Question

I have a json file with several keys. I want to use one of the keys and write that string to a file. The string originally is in unicode. So, I do, s.unicode('utf-8')

Now, there is another key in that json which I write to another file (this is a Machine learning task, am writing original string in one, features in another). The problem is that at the end, the file with the unicode string turns out to have more number of lines (when counted by using "wc -l") and this misguides my tool and it crashes saying sizes not same.

Code for reference:

for line in input_file:
                j = json.loads(line)
                text = j['text']
                label = j[t]

                output_file.write(str(label) + '\t' + text.encode('utf-8') + '\n')
                norm_file.write(j['normalized'].encode('utf-8') + '\n')

The difference when using "wc -l"

16862965

This is the number of lines I expect and what I get is

16878681

which is actually higher. So I write a script to see how many output labels are actually there

with open(sys.argv[1]) as input_file:
        for line in input_file:
                p = line.split('\t')
                if p[0] not in ("good", "bad"):
                   print p
                else:
                   c += 1


print c

And, lo and behold, I have 16862965 lines, which means some are wrong. I print them out and I get a bunch of empty new line chars ('\n'). So I guess my question is, "what am i missing when dealing with unicode like this?" Should I have stripped all leading and trailing spaces (not that there are any in the string)

Was it helpful?

Solution

JSON strings can't contain literal newlines in them e.g.,

not_a_json_string = '"\n"' # in Python source
json.loads(not_a_json_string) # raises ValueError

but they can contain escaped newlines:

json_string = r'"\n"' # raw-string literal (== '"\\n"')
s = json.loads(json_string) 

i.e., the original text (json_string) has no newlines in it (it has the backslash followed by n character -- two characters) but the parsed result does contain the newline: '\n' in s.

That is why the example:

for line in file:
    d = json.loads(line)
    print(d['key'])

may print more lines than the file contains.

It is unrelated to utf-8.

In general, there could also be an issue with non-native newlines e.g., b'\r\r\n\n', or an issue with Unicode newlines such as u'"\u2028
"' (U+2028 LINE SEPARATOR).

OTHER TIPS

Do the same check you were doing on the files written but before you write them, to see how many values get flagged. And make sure those values don't have '\\n' in them. That may be skewing your count.
For better details, see J.F.'s answer below.

Unrelated-to-your-error notes:

(a) When JSON is loads()ed, str objects are automatically unicode already:

>>> a = '{"b":1}'
>>> json.loads(a)['b']
1
>>> json.loads(a).keys()
[u'b']
>>> type(json.loads(a).keys()[0])
<type 'unicode'>

So str(label) in the file write should be either just label or unicode(label). You shouldn't need to encode text and j['normalized'] when you write them to file. Instead, set the file encoding to 'utf-8' when you open it.

(b) Btw, use format() or join() in the write operations - if any of label, text or j['normalized'] is None, the + operator will give an error.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top