How to extract nested tables from HTML?
-
09-09-2019 - |
Question
I have an HTML file (encoded in utf-8). I open it with codecs.open()
. The file architecture is:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
I need to retrieve only first table (discard the one with form). Omit all input before first <table>
and after corresponding </table>
. Some cells contains also paragraphs, bolds and scripts. There is no more than one nested table per row of main table.
How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.
I tried HTMLParse, PyParse and re module, but can't get this working. I'm quite new to Python.
Solution
Try beautiful soup
In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)
OTHER TIPS
You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...
import lxml.html
def process_row(row):
for cell in row.xpath('./td'):
inner_tables = cell.xpath('./table')
if len(inner_tables) < 1:
yield cell.text_content()
else:
yield [process_table(t) for t in inner_tables]
def process_table(table):
return [process_row(row) for row in table.xpath('./tr')]
html = lxml.html.parse('test.html')
first_table = html.xpath('//body/table[1]')[0]
data = process_table(first_table))
If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.
The XPath for pulling out the first table would be "//table[1]".