Unexpected behaviour of t.unicode('utf-8') - Python

Question 1

JSON strings can't contain literal newlines in them e.g.,

not_a_json_string = '"\n"' # in Python source
json.loads(not_a_json_string) # raises ValueError

but they can contain escaped newlines:

json_string = r'"\n"' # raw-string literal (== '"\\n"')
s = json.loads(json_string)

i.e., the original text (json_string) has no newlines in it (it has the backslash followed by n character -- two characters) but the parsed result does contain the newline: '\n' in s.

That is why the example:

for line in file:
    d = json.loads(line)
    print(d['key'])

may print more lines than the file contains.

It is unrelated to utf-8.

In general, there could also be an issue with non-native newlines e.g., b'\r\r\n\n', or an issue with Unicode newlines such as u'"\u2028 "' (U+2028 LINE SEPARATOR).

Question 2

Do the same check you were doing on the files written but before you write them, to see how many values get flagged. And make sure those values don't have '\\n' in them. That may be skewing your count.
For better details, see J.F.'s answer below.

Unrelated-to-your-error notes:

(a) When JSON is loads()ed, str objects are automatically unicode already:

>>> a = '{"b":1}'
>>> json.loads(a)['b']
1
>>> json.loads(a).keys()
[u'b']
>>> type(json.loads(a).keys()[0])
<type 'unicode'>

So str(label) in the file write should be either just label or unicode(label). You shouldn't need to encode text and j['normalized'] when you write them to file. Instead, set the file encoding to 'utf-8' when you open it.

(b) Btw, use format() or join() in the write operations - if any of label, text or j['normalized'] is None, the + operator will give an error.