German Umlauts read in with raw_input() in Python 2.7

https://stackoverflow.com/questions/22772888

25-06-2023
|

Domanda

I a programming beginner writing a simple console flashcard program for iPython to learn German words. The answer key is an excel file, which I read in, organize, and save as unicode strings. The problem occurs when the user needs to input a German word to the console.

I have this at the top:

# -*- coding: utf-8 -*-

and then later I read in (by typing into the console) the German word Kaufhäuser

var = raw_input().decode('utf-8')

Then as soon as I enter it in the console, I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 5: 
invalid start byte

Other solutions on Stack Overflow dealing with umlauts seem to point to the first line of code, or decoding the string to turn it into unicode. But in each case, the user is inputting the string to the code rather than reading it in with raw_input(), and I always get the error message.

Soluzione

You appear to be running the code in a Windows console. The console doesn't use UTF-8, it uses a code page, probably code page 437. If you decode it with 'cp437' you should get the proper Unicode, or better yet use sys.stdin.encoding to always get the proper console encoding.

var = raw_input().decode(sys.stdin.encoding)

Edit: a little experimentation shows that sys.stdin.encoding returns None when you redirect the input. A more robust solution:

# get correct encoding and use it to decode user input
encoding = 'utf-8' if sys.stdin.encoding in (None, 'ascii') else sys.stdin.encoding
var = raw_input().decode(encoding)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow