Python - Processing Unicode (Russian) Txt file

https://stackoverflow.com/questions/16703270

30-05-2022
|

Question

I have been puzzling for the last few hours in a Russian tab delimited txt file. Here is how it looks like:

CODE    AD_GROUP    KEYWORD MATCH_TYPE

009966  Автостраховка   автостраховка   Broad
009965  Автостраховка   страховкаавто   Broad
009964  Автостраховка   страховка автомобиля    Broad

The goal is to parse the txt file and for now print each keyword separately.

So far I have:

f = open("struct.txt",encoding="UTF-8",errors='strict')

for line in f:
    vals = line.split("\t")
    print(vals[2])

f.close()

But I keep getting the following errors:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Also, if I do a len(vals) to see the length of each list I get the following: 4 1 4 1 which probably means the split("\") is not working?

I am using Python3.3 and a mac.

Lastly, I don't think it is a problem with the command line on the mac not displaying Cyrillic characters - it has done it before without problems (western Windows versions seems to fail at that).

Please let me know what I am doing wrong.

Thank you!

Solution

Using this code (Python2.7)

f = open("struct.txt")

for line in f:
    vals = [item for item in line.strip().split("\t") if item != '']
    if len(vals):
        print(vals[2])

f.close()

And your source (which I edited to make sure it had tabs) I get the following output:

> python so_16703270.py
KEYWORD
автостраховка
страховкаавто
страховка автомобиля

Are you sure you have tabs throughout in the file and not spaces in some places?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow