Question

I'm parsing a html table using BS4 in python. Everything works fine and I'm able do identify all the elements that i need and print they. But the program stops working then I try to write the results into a text file. I get this error:

"UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 13: ordinal not in range(128)"

I have tried to use .encode('utf-8') in the writing command but I get something like this written : 31.61 

Here's what I'm running. I used code structure to parse another table and it worked. I appreciate if anyone can point me in the right direction.

from threading import Thread
import urllib2
import re
from bs4 import BeautifulSoup


url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan" 
myfile = open('base/basei/' + url[57:].replace("%20", " ").replace("%27","'") + '.txt','w+')
soup = BeautifulSoup(urllib2.urlopen(url).read())  
for tr in soup.find_all('tr')[0:]:
  tds = tr.find_all('td')
  if len(tds) >=0:
    print tds[0].text, ",", tds[4].text, ",", tds[7].text, ",", tds[12].text, ",", tds[14].text, ",", tds[17].text
    myfile.write(tds[0].text + ','+ tds[4].text + "," + tds[7].text + "," + tds[12].text + "," + tds[14].text + "," + tds[17].text)

myfile.close() 
Was it helpful?

Solution

Code below works for me. I replaced the non-breaking space with a comma; this way you can use the output directly as a CSV (e.g. you can easily read into Excel or LibreOffice Calc).

import urllib2                                                                  
from bs4 import BeautifulSoup                                                   

url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
soup = BeautifulSoup(urllib2.urlopen(url).read())                               

with open('out.txt', 'w') as myfile:                                           
    for tr in soup.find_all('tr')[0:]:
        tds = tr.find_all('td')
        if len(tds) >= 0:
            stripped_tds = [tds[x].text.strip() for x in (0, 4, 7, 12, 14, 17)]
            out = ','.join(stripped_tds)
            out = out.replace(u'\xa0', ',')
            print out
            myfile.write(out + '\n')

(The with statement removes the need to explicitly call myfile.close(). It implicitly does this when the section of code inside the with is complete, even if it encounters an exception there.)

Content of out.txt:

2014-04-15,E5,31.28,7,6,32.18,C
2014-04-13,E6,31.07,2,4,31.64,B
2014-04-11,E6,31.21,6,6,32.53,B
2014-04-07,E7,30.93,5,7,32.31,B
2014-04-03,S1,30.82,3,2,31.23,
2014-03-30,E9,31.02,3,8,31.97,A
2014-03-28,E9,30.95,7,8,31.85,A
2014-03-23,E9,30.88,8,8,32.06,A
2014-03-21,E6,30.83,1,1,30.83,SB
2014-03-17,E5,31.14,1,1,31.14,C
2014-03-15,E5,31.00,4,4,31.62,C
2014-03-10,E3,31.46,4,1,31.46,D
2014-03-08,A3,31.79,4,5,32.23,D
2014-03-03,A6,31.20,3,5,31.81,D
2014-03-01,E3,31.61,3,3,31.88,D
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top