How to scrape a table from a webpage?

https://datascience.stackexchange.com/questions/10857

16-10-2019
|

Question

I need to scrape a table off of a webpage and put it into a pandas data frame. But I am not being able to do it. Let me first give you a hint of how the table is encoded into html document.

<tbody>
<tr>
<th colspan="2">United States Total<strong>**</strong></th>
<td><strong>15,069.0</strong></td>
<td><strong>14,575.0</strong></td>
<td><strong>100.0</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<th colspan="7">Arizona</th>
</tr>
<tr>
<td>Pinal Energy, LLC</td>
<td>Maricopa, AZ</td>
<td>50.0</td>
<td>50.0</td>
<td>NA</td>
<td>2012-07-01</td>
<td>2014-03</td>
</tr>
<tr>
<td colspan="2"><strong>Arizona Total</strong></td>
<td>50.0</td>
<td>50.0</td>
<td>NA</td>
<td></td>
<td></td>
</tr>
<tr>

The body of the table begins with <tbody>....</tbody>. Each <tr>....</tr> is a row of the table.Within each row, that is within each pair of <tr>....</tr>, each column is given by <td>50.0</td>.

Here are my questions:

1) How do I scrape it ? I am using BeautifulSoup and requests for this purpose as well as pandas module. I tried the following:

r = requests.get(url)
bs = BeautifulSoup(r.text)
info = bs.findALL('tr','td')
  ....
  ....

But it is giving me this error:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-32d9483e2c59> in <module>()
      1 bs = BeautifulSoup(r.text)
----> 2 info = bs.findALL('tr','td')
      3 #print bs

TypeError: 'NoneType' object is not callable

2) I need to skip some of the rows based on the text in it. For example I don't want to read in the row in which the word 'Total' appears (as in<th colspan="2">United States Total<strong>**</strong></th>). How do I do that ? Although, it is not extremely important as I can get rid off it later, but skipping these rows while reading the data is ideally what I need.

I know it is a long post, but if someone can help me with it, i would greatly appreciate it. Please let me know if more information is needed.

Thanks much.

Solution

This will give you all the values under <tr>:

bs=BeautifulSoup(data, "lxml")
table_body=bs.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols=row.find_all('td')
    cols=[x.text.strip() for x in cols]
    print cols

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange