BeutifulSoap is a real awsome html parser.Use it to its maximum potential for parsing html. So just modify your code like follows
names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]
this will take the between the anchor tabs so you wont need d = (str(eachname.string.split()))+"\n"
So final code will be
from bs4 import BeautifulSoup
import urllib2
import re
import codecs
url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]
myfile = codecs.open("base/dogs.txt","wb",encoding="Utf-8")
for eachname in names:
eachname=re.sub(r"[\t\n]","",eachname)
myfile.write(eachname+"\n")
myfile.close()
If you just need it without u in the file then
Use codecs.open()
or io.open()
to open a text file using an appropriate text encoding (i.e. encoding="..."
) instead of opening a bytefile with open()
.
that would be
myfile = codecs.open("base/dogs.txt","w+",encoding="Utf-8")
in your case.
and the output in the file will be
BARTSSHESWAYCOOL
DK'S SEND ALL
SHAKIN THINGS UP
FROSTED COOKIE
JD EMBELLISH
WW CASH N CARRY
FREEDOM ROCK
HVAC BUTCHIE
Also see this question which I had asked for almost same problem