Question

Hi ' im using Beautifulsoup to parse a website and get a name as output. But after running the script, i get a [u'word1', u'word2', u'word3'] output. What i'm looking for is 'word1 word2 word3'. how do get rid of this u' and make the result a single string?

from bs4 import BeautifulSoup
import urllib2
import re

myfile = open("base/dogs.txt","w+")
myfile.close()

url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=soup.findAll('a',{'href':re.compile("dog")})
myfile = open("base/dogs.txt","w+")
for eachname in names:
    d = (str(eachname.string.split()))+"\n"
    print [x.encode('ascii') for x in d]
    myfile.write(d)

myfile.close()
Was it helpful?

Solution 2

The answers here using .encode() are giving you what you ask for, but probably not what you need. You can keep the unicode encoding and not represent things in a way that shows you what their encoding or type is. Thus, they'll still be [u'word1', u'word2', u'word3'] -- which avoids breaking support for languages that can't be represented in ASCII -- but printed as word1 word2 word3.

Just do:

for eachname in names:
    d = ' '.join(eachname.string.split()) + '\n'
    print d
    myfile.write(d)

OTHER TIPS

BeautifulSoup and Unicode, Dammit!

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("Sacré bleu!")
<html><body><p>Sacré bleu!</p></body></html>

Isn't that great? When making the soup the document is converted to Unicode, and HTML entities are converted to Unicode characters! So you get Unicode objects as results. Like intended. Nothing wrong with that.

So your question is about Unicode. And Unicode is explained in this video. Don't like video's? Read an Introduction to Unicode.

The u is short for 'The following sting is Unicode encoded'. Instead of 128 ASCII characters you now can use all Unicode characters. More than 110.000 at this moment. The u isn't saved to a file or database. It is visual feedback so you can see that you're dealing with a Unicode encoded string. Use it like it's a normal string, because it is a normal string.

Moral of this story:

☺ when you see a u'…'

BeutifulSoap is a real awsome html parser.Use it to its maximum potential for parsing html. So just modify your code like follows

names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]

this will take the between the anchor tabs so you wont need d = (str(eachname.string.split()))+"\n"

So final code will be

from bs4 import BeautifulSoup
import urllib2
import re
import codecs
url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]
myfile = codecs.open("base/dogs.txt","wb",encoding="Utf-8")
for eachname in names:
    eachname=re.sub(r"[\t\n]","",eachname)
    myfile.write(eachname+"\n")
myfile.close()

If you just need it without u in the file then Use codecs.open() or io.open() to open a text file using an appropriate text encoding (i.e. encoding="...") instead of opening a bytefile with open().

that would be

myfile = codecs.open("base/dogs.txt","w+",encoding="Utf-8")

in your case.

and the output in the file will be

BARTSSHESWAYCOOL                            
DK'S SEND ALL                            
SHAKIN THINGS UP                            
FROSTED COOKIE                            
JD EMBELLISH                            
WW CASH N CARRY                            
FREEDOM ROCK                            
HVAC BUTCHIE 

Also see this question which I had asked for almost same problem

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top