Question

I am executing a subprocess using Popen and feeding it input as follows (using Python 2.7.4):

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
args = ['chasen', '-i u', '-F"%m "']
process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
out, err = process.communicate(input=string)

Adding the entry to the environment it is executed with is necessary because the input string includes Japanese characters, and when the script is not executed from the command line (in my case being called by Apache), Python cannot guess the encoding.

This setup has worked fine for me with other commands, however now I'm using chasen (a Japanese tokenizer), whenever I send it unicode characters the subprocess does not return, and it just sits there with the Python script chewing up memory. This seems like an encoding problem, but I thought I had would have sorted this out by specifying the encoding with the LC_ALL environment variable.

Edit: Extra weirdness as follows... I don't get this problem when executing the Python script from the command line with the notable exception of the '。' character. For some reason this causes the strangeness from chasen also.

Was it helpful?

Solution

This is a bug in chasen. When run through Python, you can see the following syscalls it issues:

write(1, "\n", 1)                       = 1
read(0, "", 4096)                       = 0
write(1, "\n", 1)                       = 1
read(0, "", 4096)                       = 0

i.e. it does not correctly handle EOF. To fix this, simply affix a newline ('\n') to your Python string, like this:

# coding: utf-8
import os
from subprocess import Popen, PIPE

string = u"悪妻は百年の不作。"

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
args = ['chasen', '-i u', '-F"%m "']
process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
out, err = process.communicate(input=(string + u'\n').encode('utf-8'))

print(out)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top