Question

I'm scraping and saving (as a comma-delimited text file) information on roll call votes in the US House of Representatives.

Each line in the resulting file takes the following form:

Roll Call Number, Bill, Date, Representative, Vote, Total Yeas, Total Nays

Where I'm running into trouble is scraping the dates from 1-Nov-2001 (roll call 414) onward. Instead of matching 1-Nov-2001, the regex matches incorrectly or breaks. In the first case, it matches the string '-AND-'. The text does change between #414 and #415 to include the string 'YEAS-AND-NAYS'.

I'm betting I've written the regex wrong, but I'm not seeing it. What might I need to change to match the date instead? The relevant code is below.

import urllib2, datetime, sys, re, string
import xml.etree.ElementTree as ET

for i in range(414,514):
    if i < 10:
        num_string = "00"+str(i)
    elif i < 100:
        num_string = "0"+str(i)
    elif i > 100:
        num_string = str(i)
    print num_string, datetime.datetime.now()
    url = "http://clerk.house.gov/evs/2001/roll"+num_string+".xml"
    text = urllib2.urlopen(url).read()
    tree = ET.fromstring(text)
    notags = ET.tostring(tree, encoding="utf8", method="text")
    dte = re.search(r'[0-9]*-[A-Za-z]*-[0-9]*', notags).group()
    print dte
Was it helpful?

Solution

Using a regular expression against an XML document is never a good idea (seriously).

You can achieve the desired result without any regular expressions by extracting the date from the relevant XML element (I've used lxml.etree instead of xml.etree.ElementTree, but the principle will be the same).

Also, I've added an easier way to generate a 3-digit number (leading 0 if necessary).

import urllib2, datetime, sys, string
import lxml.etree

for i in range(414,416):
    num_string = '{:03d}'.format(i)
    print num_string, datetime.datetime.now()
    url = "http://clerk.house.gov/evs/2001/roll"+num_string+".xml"
    xml = lxml.etree.parse(urllib2.urlopen(url))
    root = xml.getroot()
    actdate = root.xpath('//action-date')[0]
    dte = actdate.text.strip()
    print dte

If you insist on using a regular expression, then [0-9]+-[A-Za-z]+-[0-9]+ would be better as it guarantees at least one digit followed by dash followed by at least one letter followed by dash followed by at least one digit (as holdenweb mentions in his comment).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top