Pregunta

I am looking for a way to cleanly convert HTML tables to readable plain text.

I.e. given an input:

<table>
    <tr>
        <td>Height:</td>
        <td>200</td>
    </tr>
    <tr>
        <td>Width:</td>
        <td>440</td>
    </tr>
</table>

I expect the output:

Height: 200
Width: 440

I would prefer not using external tools, e.g. w3m -dump file.html, because they are (1) platform-dependent, (2) I want to have some control over the process and (3) I assume it is doable with Python alone with or without extra modules.

I don't need any word-wrapping or adjustable cell separator width. Having tabs as cell separators would be good enough.

Update

This was an old question for an old use case. Given that pandas provides the read_html method, my current answer would definitely be pandas-based.

¿Fue útil?

Solución

How about using this:

Parse HTML table to Python list?

But, use collections.OrderedDict() instead of simple dictionary to preserve order. After you have a dictionary, it is very-very easy to get and format the text from it:

Using the solution of @Colt 45:

import xml.etree.ElementTree
import collections

s = """\
<table>
    <tr>
        <th>Height</th>
        <th>Width</th>
        <th>Depth</th>
    </tr>
    <tr>
        <td>10</td>
        <td>12</td>
        <td>5</td>
    </tr>
    <tr>
        <td>0</td>
        <td>3</td>
        <td>678</td>
    </tr>
    <tr>
        <td>5</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>
"""

table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    for key, value in collections.OrderedDict(zip(headers, values)).iteritems():
        print key, value

Output:

Height 10
Width 12
Depth 5
Height 0
Width 3
Depth 678
Height 5
Width 3
Depth 4

Otros consejos

You should look at the standard library modules ElementTree and minidom

You can use HTQL module at http://htql.net.

Here is the sample code for your page:

import urllib2
url='http://pastebin.com/yRQvz2Ww'
page=urllib2.urlopen(url).read();

query="""<div (ID='super_frame')>1.<div (ID='monster_frame')>1.<div (ID='content_frame')>1.<div (ID='content_left')>1.<div (ID='code_frame2')>1.<div (ID='code_frame')>1.<div (ID='selectable')>1.<div (CLASS='html4strict')>1 &tx
<table>.<tr>{
    c1=<td>:colspan;   t1=<td>1 &tx; 
    c2=<td>2:colspan;   t2=<td>2 &tx;
    c3=<td>3:colspan;   t3=<td>3 &tx; 
    c4=<td>4:colspan;   t4=<td>4 &tx;
    c5=<td>5:colspan;   t5=<td>5 &tx;
}
"""

for t in htql.query(page, query): 
    print('\t'.join(t)); 

The htql.query() produces 10 columns including the c1, t2, c2, t2, ... c5, t5. You can use the c1..c5 information to know which cells the t1..t5 should be in.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top