Pregunta

I am trying to scrape data table from http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States

I used following code:

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table",{ "class" : "wikitable" })

for row in table.findAll('tr')[1:]:
col = row.findAll('th')
Vehicle = col[0].string
Year1 = col[2].string
Year2 = col[3].string
Year3 = col[4].string
Year4 = col[5].string
Year5 = col[6].string
Year6 = col[7].string
Year7 = col[8].string
Year8 = col[9].string
Year9 = col[10].string
Year10 = col[11].string
Year11 = col[12].string
Year12 = col[13].string
Year13 = col[14].string
Year14 = col[15].string
Year15 = col[16].string
Year16 = col[17].string
record =(Vehicle,Year1,Year2,Year3,Year4,Year5,Year6,Year7,Year8,Year9,Year10,Year11,Year12,Year13,Year14,Year15,Year16)
print "|".join(record)

I get this error

 File "scrap1.ph", line 13
    col = row.findAll('th')
      ^
IndentationError: expected an indented block

Can anybody let me know what i am doing wrong.

¿Fue útil?

Solución

Besides @traceur's point about the indentation error, here's how you can simplify the code dramatically:

from mechanize import Browser
from bs4 import BeautifulSoup

mech = Browser()
url = "http://en.wikipedia.org/wiki/Hybrid_electric_vehicles_in_the_United_States"
soup = BeautifulSoup(mech.open(url))
table = soup.find("table", class_="wikitable")

for row in table('tr')[1:]:
    print "|".join(col.text.strip() for col in row.find_all('th'))

Note that instead of using from BeautifulSoup import BeautifulSoup (3rd version of BeautifulSoup), you'd better use from bs4 import BeautifulSoup (4th version), since the 3rd version is no longer maintained.

Also note that you can pass mech.open(url) directly to the BeautifulSoup constructor instead of manually reading it.

Hope that helps.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top