xml - extracting text from specific tags in html using Mathematica -
for page html structure:
<tr class=""> <td class="number">1</td> <td class="name"><a href="..." >jack green</a></td> <td class="score-cell "> <span class="display">98 <span class="tooltip column1"></span> </span> </td> <td class="score-cell "> ... </td> ... <tr class=""> <td class="number">2</td> <td class="name"><a href="..." target="_top">nicole smith</a></td> <td class="score-cell "> ... </td>
how extract text name tag end list {jack green, nicole smith}
? method elegant hope.
input = " <tr class=\"\"> <td class=\"number\">1</td> <td class=\"name\"><a href=\"...\" >jack green</a></td> <td class=\"score-cell \"> <span class=\"display\">98 <span class=\"tooltip column1\"></span> </span> </td> <td class=\"score-cell \"> ... </td> ... <tr class=\"\"> <td class=\"number\">2</td> <td class=\"name\"><a href=\"...\" target=\"_top\">nicole smith</a></td> <td class=\"score-cell \"> ... </td>"; (* eliminate unnecessary whitespace , add start character *) html = stringjoin["x", stringreplace[stringtrim[input], {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]]; (* find tags , positions of tags containing 'name' *) tags = stringcases[html, "<" ~~ except[">"] .. ~~ ">"]; nametagpositions = position[stringmatchq[tolowercase /@ tags, "*name*"], true]; (* split on tags , extract on name tag positions *) splits = stringsplit[html, "<" ~~ except[">"] .. ~~ ">"]; extract[splits, nametagpositions + 2]
{jack green, nicole smith}
note
the start character required guarantee correct split. can see in demonstration below, initial splits between a
characters not counted until there substring report. start character positions of required items can reliably used.
html = "aa1aaa2aa"; splits = stringsplit[html, "a"]
{1, , ,2}
html = "aaaaaaa1aaa2aaaaaaa"; splits = stringsplit[html, "a"]
{1, , ,2}
html = "0aaaaaaa1aaa2aaaaaaa"; splits = stringsplit[html, "a"]
{0, , , , , , ,1, , ,2}
Comments
Post a Comment