Question

I tried to parse the table from https://www.neb.com/tools-and-resources/usage-guidelines/nebuffer-performance-chart-with-restriction-enzymes with Pythons library lxml but if I try it with some snippets of code from similar extracting versions (How to extract tables from websites in Python) I run into problems with <a>-tags and images which are displayed in this table.. In the end I want a text file with the following columns of this restriction enzyme table from NEB without any formatting, just plain text:

Enzyme | Sequence | NEBuffer | % Activity in NEBuffer | Heat Inac. | Incu. Temp.

I wanted to try to extract each td of a row by its own and combine the information in a list entry:

from urllib2 import urlopen
from lxml import etree
url = "https://www.neb.com/tools-and-resources/usage-guidelines/nebuffer-performance-chart-with-restriction-enzymes"
tree = etree.HTML(urlopen(url).read())

rows = tree.xpath('//*[@id="form1"]/div[2]/div/div/section[@class="chart"]/table/tbody/tr')

cells = [[rows.xpath('//td/a/text()'), 
          rows.xpath('//td/text()')] for tr in rows]
print cells[1]

But it mixes up everything in just one entry and I do not know how to deal with those special characters like 'u and \u2122 The first lines of the output:

[['AatII', u'CutSmart\u2122 Buffer', 'AbaSI', 'NEBuffer 4', 'Acc65I', 'NEBuffer 3.1', 'AccI', u'CutSmart\u2122 Buffer', 'AciI', u'CutSmart\u2122 Buffer', 

and I think I do not have coded that columns like the images in column 2 are skipped :/

I hope my question is detailed enough so you are able to understand what I am trying to do.

Was it helpful?

Solution

First of all the \u2122 is only the ASCII-friendly representation of the unicode character. If you print() the string, you'll see that character instead of that. So no worries!

then, your code does not work for me:

tree.xpath('//*[@id="form1"]/div[2]/div/div/section[@class="chart"]/table/tbody/tr')

is returning a list, which makes it impossible to do:

rows.xpath('//td/a/text()')

so I don't get how you're getting a result. And even if it was working, there's something you don't get with XPath, it's that // makes the search start at the root of the document, which why you're getting every content of a a tag within a td tag, not the one inside the tr you're in.

On the contrary, if you use a relative xpath, the following would work:

>>> rows[0].xpath('td/a')
[<Element a at 0x2e3ff50>, <Element a at 0x2e3ff00>]
>>> rows[0].xpath('td/a/text()')
['AatII', u'CutSmart\u2122 Buffer']

But thing is that doing so is too generic and you won't be able to keep element in the order of interest. And sadly, there's no automatic way to get that interesting stuff out of it.

Then You need to take the HTML, and see that you want the alt of the image in that td, that you want to take the content of the span in that other one:

<tr>
    <td>
        <a href="/products/r0117-aatii">AatII</a>
    </td>
    <td>
        <img class="product-icon" longdesc="This enzyme is purified from a recombinant source." alt="recombinant" src="/~/media/Icons/icon_recomb.gif">
        <img class="product-icon" longdesc="This enzyme is capable of digesting 1 µg of DNA in 5 minutes." alt="timesaver 5min" src="/~/media/Icons/icon_timesaver5.gif">
        <img class="product-icon" longdesc="Cleavage with this restriction enzyme is blocked when the substrate DNA is methylated by CpG methylase." alt="cpg" src="/~/media/Icons/icon_cpg.gif">
    </td>
    <td>GACGT/C</td>
    <td>
        <a href="/products/b7204-cutsmart-buffer">CutSmart™ Buffer</a>
    </td>
    <td>10</td>
    <td>50*</td>
    <td>50</td>
    <td>100</td>
    <td>
        <span style="color:red;">80°C</span>
    </td>
    <td>37°C</td>
    <td>B </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Blocked" src="/~/media/Icons/Blocked.gif">
    </td>
    <td>λ DNA</td>
    <td></td>
</tr>

The following is getting the values of interest from the document you linked:

>>> for row in rows: print row[0].xpath('a/text()'), [img.attrib['alt'] for img in row[1].xpath('img')], row[2].text, row[3].xpath('a/text()'), row[4].text, row[5].text, row[6].text, row[7].text, row[8].xpath('span/text()'), row[9].text, [img.attrib['alt'] for img in row[10].xpath('img')], [img.attrib['alt'] for img in row[11].xpath('img')], [img.attrib['alt'] for img in row[12].xpath('img')], row[13].text, row[14].text
['AatII'] ['recombinant', 'timesaver 5min', 'cpg'] GACGT/C [u'CutSmart\u2122 Buffer'] 10 50* 50 100 [u'80\xb0C'] 37°C [] ['Not Sensitive'] ['Not Sensitive'] None λ DNA
['AbaSI'] ['recombinant'] None ['NEBuffer 4'] 25 50 50 100 [] 25°C [] ['Not Sensitive'] ['Not Sensitive'] None None
['Acc65I'] ['recombinant', 'timesaver 5min', 'dcm', 'cpg'] G/GTACC ['NEBuffer 3.1'] 10 75* 100 25 [] 37°C [] ['Not Sensitive'] ['Blocked by Some Combinations of Overlapping'] None pBC4 DNA
...

which gets all the fields.

In the end, to make it easily reuseable, here's what I'd do:

 enzimes = [{ 'enzime'                     : row[0].xpath('a/text()'),
              'attributes'                 : [img.attrib['alt'] for img in row[1].xpath('img')],
              'Supplied NEBuffer'          : row[2].text,
              '% Activity in NEBuffer 1.1' : row[3].xpath('a/text()'),
              '% Activity in NEBuffer 2.1' : row[4].text,
              '% Activity in NEBuffer 3.1' : row[5].text,
              'CutSmart'                   : row[6].text,
              'Heat Inac.'                 : row[7].text,
              'Incu. Temp.'                : row[8].xpath('span/text()')[0] if len(row[8].xpath('span/text()')) > 0 else row[8].text,
              'Diluent'                    : row[9].text,
              'Dam'                        : [img.attrib['alt'] for img in row[10].xpath('img')],
              'Dcm'                        : [img.attrib['alt'] for img in row[11].xpath('img')],
              'CpG'                        : [img.attrib['alt'] for img in row[12].xpath('img')],
              'Unit Substrate'             : row[13].text,
              'Note'                       : row[14].text
            } for row in rows]

and for the first enzime, here's the result:

>>> import pprint
>>> pprint.pprint(enzimes[0])
{'% Activity in NEBuffer 1.1': [u'CutSmart\u2122 Buffer'],
 '% Activity in NEBuffer 2.1': '10',
 '% Activity in NEBuffer 3.1': '50*',
 'CpG': ['Not Sensitive'],
 'CutSmart': '50',
 'Dam': [],
 'Dcm': ['Not Sensitive'],
 'Diluent': u'37\xb0C',
 'Heat Inac.': '100',
 'Incu. Temp.': u'80\xb0C',
 'Note': u'\u03bb DNA',
 'Supplied NEBuffer': 'GACGT/C',
 'Unit Substrate': None,
 'attributes': ['recombinant', 'timesaver 5min', 'cpg'],
 'enzime': ['AatII']}

HTH

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top