Extracting a table from web with different hierarchical tags

Question

First of all the \u2122 is only the ASCII-friendly representation of the ™ unicode character. If you print() the string, you'll see that character instead of that. So no worries!

then, your code does not work for me:

tree.xpath('//*[@id="form1"]/div[2]/div/div/section[@class="chart"]/table/tbody/tr')

is returning a list, which makes it impossible to do:

rows.xpath('//td/a/text()')

so I don't get how you're getting a result. And even if it was working, there's something you don't get with XPath, it's that // makes the search start at the root of the document, which why you're getting every content of a a tag within a td tag, not the one inside the tr you're in.

On the contrary, if you use a relative xpath, the following would work:

>>> rows[0].xpath('td/a')
[<Element a at 0x2e3ff50>, <Element a at 0x2e3ff00>]
>>> rows[0].xpath('td/a/text()')
['AatII', u'CutSmart\u2122 Buffer']

But thing is that doing so is too generic and you won't be able to keep element in the order of interest. And sadly, there's no automatic way to get that interesting stuff out of it.

Then You need to take the HTML, and see that you want the alt of the image in that td, that you want to take the content of the span in that other one:

<tr>
    <td>
        <a href="/products/r0117-aatii">AatII</a>
    </td>
    <td>
        <img class="product-icon" longdesc="This enzyme is purified from a recombinant source." alt="recombinant" src="/~/media/Icons/icon_recomb.gif">
        <img class="product-icon" longdesc="This enzyme is capable of digesting 1 µg of DNA in 5 minutes." alt="timesaver 5min" src="/~/media/Icons/icon_timesaver5.gif">
        <img class="product-icon" longdesc="Cleavage with this restriction enzyme is blocked when the substrate DNA is methylated by CpG methylase." alt="cpg" src="/~/media/Icons/icon_cpg.gif">
    </td>
    <td>GACGT/C</td>
    <td>
        <a href="/products/b7204-cutsmart-buffer">CutSmart™ Buffer</a>
    </td>
    <td>10</td>
    <td>50*</td>
    <td>50</td>
    <td>100</td>
    <td>
        <span style="color:red;">80°C</span>
    </td>
    <td>37°C</td>
    <td>B </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Blocked" src="/~/media/Icons/Blocked.gif">
    </td>
    <td>λ DNA</td>
    <td></td>
</tr>

The following is getting the values of interest from the document you linked:

>>> for row in rows: print row[0].xpath('a/text()'), [img.attrib['alt'] for img in row[1].xpath('img')], row[2].text, row[3].xpath('a/text()'), row[4].text, row[5].text, row[6].text, row[7].text, row[8].xpath('span/text()'), row[9].text, [img.attrib['alt'] for img in row[10].xpath('img')], [img.attrib['alt'] for img in row[11].xpath('img')], [img.attrib['alt'] for img in row[12].xpath('img')], row[13].text, row[14].text
['AatII'] ['recombinant', 'timesaver 5min', 'cpg'] GACGT/C [u'CutSmart\u2122 Buffer'] 10 50* 50 100 [u'80\xb0C'] 37°C [] ['Not Sensitive'] ['Not Sensitive'] None λ DNA
['AbaSI'] ['recombinant'] None ['NEBuffer 4'] 25 50 50 100 [] 25°C [] ['Not Sensitive'] ['Not Sensitive'] None None
['Acc65I'] ['recombinant', 'timesaver 5min', 'dcm', 'cpg'] G/GTACC ['NEBuffer 3.1'] 10 75* 100 25 [] 37°C [] ['Not Sensitive'] ['Blocked by Some Combinations of Overlapping'] None pBC4 DNA
...

which gets all the fields.

In the end, to make it easily reuseable, here's what I'd do:

 enzimes = [{ 'enzime'                     : row[0].xpath('a/text()'),
              'attributes'                 : [img.attrib['alt'] for img in row[1].xpath('img')],
              'Supplied NEBuffer'          : row[2].text,
              '% Activity in NEBuffer 1.1' : row[3].xpath('a/text()'),
              '% Activity in NEBuffer 2.1' : row[4].text,
              '% Activity in NEBuffer 3.1' : row[5].text,
              'CutSmart'                   : row[6].text,
              'Heat Inac.'                 : row[7].text,
              'Incu. Temp.'                : row[8].xpath('span/text()')[0] if len(row[8].xpath('span/text()')) > 0 else row[8].text,
              'Diluent'                    : row[9].text,
              'Dam'                        : [img.attrib['alt'] for img in row[10].xpath('img')],
              'Dcm'                        : [img.attrib['alt'] for img in row[11].xpath('img')],
              'CpG'                        : [img.attrib['alt'] for img in row[12].xpath('img')],
              'Unit Substrate'             : row[13].text,
              'Note'                       : row[14].text
            } for row in rows]

and for the first enzime, here's the result:

>>> import pprint
>>> pprint.pprint(enzimes[0])
{'% Activity in NEBuffer 1.1': [u'CutSmart\u2122 Buffer'],
 '% Activity in NEBuffer 2.1': '10',
 '% Activity in NEBuffer 3.1': '50*',
 'CpG': ['Not Sensitive'],
 'CutSmart': '50',
 'Dam': [],
 'Dcm': ['Not Sensitive'],
 'Diluent': u'37\xb0C',
 'Heat Inac.': '100',
 'Incu. Temp.': u'80\xb0C',
 'Note': u'\u03bb DNA',
 'Supplied NEBuffer': 'GACGT/C',
 'Unit Substrate': None,
 'attributes': ['recombinant', 'timesaver 5min', 'cpg'],
 'enzime': ['AatII']}

HTH