Scraping Multiple html files to CSV

https://stackoverflow.com/questions/923318

06-09-2019
|

Question

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.

This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.

How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?

I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.

Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:

#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser

mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")

table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")

##how do I end it?

Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?

Solution

You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/Data/*.htm'):
    soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
    for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
        print a_tr

Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.

OTHER TIPS

MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:

import glob
    for file_name in glob.glob('/home/phi/Data/*.htm'):
        #read the file and then parse with BeautifulSoup

I've found both the os and glob imports to be really useful for running through files in a directory.

Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow