xml - extracting text from specific tags in html using Mathematica -

for page html structure:

          <tr class="">             <td class="number">1</td>             <td class="name"><a href="..." >jack green</a></td>             <td class="score-cell ">               <span class="display">98                 <span class="tooltip column1"></span>               </span>             </td>             <td class="score-cell ">               ...             </td>           ...           <tr class="">             <td class="number">2</td>             <td class="name"><a href="..." target="_top">nicole smith</a></td>             <td class="score-cell ">              ...             </td>

how extract text name tag end list {jack green, nicole smith}? method elegant hope.

input =   "          <tr class=\"\">               <td class=\"number\">1</td>               <td class=\"name\"><a href=\"...\" >jack green</a></td>               <td class=\"score-cell \">                 <span class=\"display\">98                   <span class=\"tooltip column1\"></span>                 </span>               </td>               <td class=\"score-cell \">                 ...               </td>             ...             <tr class=\"\">               <td class=\"number\">2</td>               <td class=\"name\"><a href=\"...\" target=\"_top\">nicole smith</a></td>               <td class=\"score-cell \">                ...               </td>";  (* eliminate unnecessary whitespace , add start character *) html = stringjoin["x", stringreplace[stringtrim[input],    {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];  (* find tags , positions of tags containing 'name' *) tags = stringcases[html, "<" ~~ except[">"] .. ~~ ">"]; nametagpositions = position[stringmatchq[tolowercase /@ tags, "*name*"], true];  (* split on tags , extract on name tag positions *) splits = stringsplit[html, "<" ~~ except[">"] .. ~~ ">"]; extract[splits, nametagpositions + 2]

{jack green, nicole smith}

note

the start character required guarantee correct split. can see in demonstration below, initial splits between a characters not counted until there substring report. start character positions of required items can reliably used.

html = "aa1aaa2aa"; splits = stringsplit[html, "a"]

{1, , ,2}

html = "aaaaaaa1aaa2aaaaaaa"; splits = stringsplit[html, "a"]

{1, , ,2}

html = "0aaaaaaa1aaa2aaaaaaa"; splits = stringsplit[html, "a"]

{0, , , , , , ,1, , ,2}

Search This Blog

Brant

xml - extracting text from specific tags in html using Mathematica -

Comments

Post a Comment

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -