Question

I need to check the spelling of Russian words from a Python script. I am piping those words to hunspell via shell. My hunspell dictionaries are all UTF8. I have no problems using them from the command line.

But something funky is happening when I try to send the strings from my Python script.

If I use the German dictionary:

text = "Universitüt"
cmd = "echo " +text + " | /usr/local/bin/hunspell -d German_de_DE"
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, executable="/bin/bash")
result, err = p.communicate()
if result:
    result = result.split()
    print(result)

I get the response that I am expecting

[b'Hunspell', b'1.3.2', b'&', b'Universit', b'4', b'0:', b'Universit\xc3\xa4r,', b'Universit\xc3\xa4t,', b'Universen,', b'Universaler', b'*']

and I can deal with that. But if I send a Russian word to the Russian dictionary with the same code except, of course:

text = "университат"
cmd = "echo " +text + " | /usr/local/bin/hunspell -d Russian_ru_RU"

The response from hunspell is empty:

[b'Hunspell', b'1.3.2']

Directly from bash it works:

echo университат | hunspell -d Russian_ru_RU
Hunspell 1.3.2
& университат 1 0: университет

So I suppose it's some kind of encoding issue. But I am at a loss as to what it could be considering that my locale is utf-8 and python's sys.getdefaultencoding() also says utf-8.

I am using python 3.3.2 on Mac OS X.

Any tips would be greatly appreciated.

Was it helpful?

Solution

As Iwan Aucamp suggested in the comments, the solution is to use:

hunspell -i UTF-8 ...

i.e. make sure that hunspell knows it's getting UTF-8 strings.

Once I added that to my code, the results I was getting in the shell (even without the -i flag) and the results I was getting by piping strings to hunspell from Python were the same.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top