Parsing HTML rows into CSV

https://stackoverflow.com/questions/1086266

23-08-2019
|

Question

First off the html row looks like this:

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

I would show the real html but I am sorry to say don't know how to block it. feels shame

Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.

I have been goofing around with glob as the best way to do this from some advice. This is what I have so far and am stuck.

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.

Solution

You need to import the csv module by adding import csv to the top of your file.

Then you'll need something to create a csv file outside your loop of the rows, like so:

writer = csv.writer(open("%s.csv" % filename, "wb"))

Then you need to actually pull the data out of the html row in your loop, similar to

values = (td.fetchText() for td in row)
writer.writerow(values)

OTHER TIPS

You don't really explain why you are stuck - what's not working exactly?

The following line may well be your problem:

soup = BeautifulSoup(open(filename["r"]))

It looks to me like this should be:

soup = BeautifulSoup(open(filename, "r"))

The following line:

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

looks like it will only pick out even rows (assuming your even rows have the class 'evenColor' and odd rows have 'oddColor'). Assuming you want all rows with a class of either evenColor or oddColor, you can use a regular expression to match the class value:

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })

That looks fine, and BeautifulSoup is useful for this (although I personally tend to use lxml). You should be able to take that data you get, and make a csv file out of is using the csv module without any obvious problems...

I think you need to actually tell us what the problem is. "It still doesn't work" is not a problem descripton.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow