سؤال

So in short my case is this:

  • Read data from RSS feed
  • Print content to the terminal

And of course the content isn't in plain ascii, it's utf-8, so I get characters like "öäå". But when I print the text it's all mangled up with ecapes like '\xe4'. Something to do with the encoding but I just can't get my head around this. This should be so trivial to do yet google fu is letting me down.

One example is when I'm going through the content word by word and trying to find the character "ö": I do:

if u"ö" in word:

Which just gives: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6...

Edit:

So I think I found my problem. I was getting the feed items then just doing str(entry.content) and passing that onwards, but that entry.content was a list holding a dictionary with unicode strings as values, so what I did (I guess) was just getting an ascii representation of the dictionary content...

هل كانت مفيدة؟

المحلول

You're trying to compare encoded text to unicode. Python doesn't know the encoded text is UTF-8, so it guesses it's ASCII and tries to decode it to unicode for you. The solution is to decode it explicitly with the proper encoding.

Check out the Python Unicode HOWTO for more info.

I can reproduce your problem with this file:

# coding: utf-8

word = "öäå"
if u"ö" in word:
    print True

And fix it with this file:

# coding: utf-8

word = "öäå".decode('utf-8')
if u"ö" in word:
    print True

نصائح أخرى

If you know that your text is UTF-8, you can decode it into unicode objects before you start working with them. As soon as you read the bytes from the file, you can decode them using the decode() method of strings as word.decode('UTF8') to return the unicode objects.

Try to use feedparser package http://packages.python.org/feedparser/ It deals with encodings well and supports almost all feeds formats. You will just get well-structured data.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top