Python unicode search not giving correct answer

https://stackoverflow.com/questions/10053756

29-05-2021
|

Domanda

I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:

import codecs

hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []

for counter, line in enumerate(hypernyms):
    count_arr.append(0)
    for word in words:
        if line.find(word) >=0:
            count_arr[counter] +=1

for iterator, count in enumerate(count_arr):
if count>0:
    print iterator, ' ', count

This is finding some words, but ignoring some others The input files are: File-1:

पौधा  
वनस्पति

File-2:

वनस्पति, पेड़-पौधा  
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग  
पादप_समूह, पेड़-पौधे, वनस्पति_समूह  
पेड़-पौधा

This gives output:

0 1  
3 1

Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

Soluzione

That because You don't remove the "\n" charactor at the end of lines. So you don't search "some_pattern\n", not "some_pattern". Use strip() function to chop them off like this:

import codecs

words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []

for line in hypernyms:
    count_arr.append(0)
    for word in words:
        count_arr[-1] += (word in line)

for count in enumerate(count_arr):
    if count:
        print iterator, ' ', count

Altri suggerimenti

I think the problem is here:

words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()

.readlines() will leave the line break at the end, so you're not searching for पौधा, you're searching for पौधा\n, and you'll only match at the end of a line. If I use .read().split() instead, I get

0   2
2   1
3   1

Put this code and you will see why that happens,because of the spaces: in file 1 the first word is पौधा[space]....

for i in hypernyms:
    print "file1",i

for i in words:
    print "file2",i

After count_arr = [] and before for counter, line...

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow