getting the mid frequency words from a list and getting their synonyms by web scraping Indo Wordnet in python

https://stackoverflow.com/questions/22959619

30-06-2023
|

Question

I am quite novice in Python and for a project I need to use a wordlist corpus which already contains the Frequency of the words and have to get the mid frequency words. this particular corpus does not contain headers for each columns. What I am trying to do is to get he high frequency words and the low frequency ones, and get rid of them. I have tried out this piece of code, but failed.

list1 = open('C:\Python27\bengali_wordlist_full.txt', 'r').read()

list2=[]

 >>> counts = Counter(list1)
 >>> print(counts)
 Counter({'\xe0': 2258118, '\xa6': 1788720, '\xa7': 542685, '\n': 292624, '\t': 292624, '\xbe': 243763, '1': 186672, '\xb0': 182384, '\x87': 165988, '\x8d': 164978, '\xbf': 133210, '\xa8': 110359, '\x95': 90861, '\xac': 80290, '\x9f': 75845, '\xa4': 75037, '\xb8': 74818, '\xb2': 72230, '\xae': 70510, '\x81': 63316, '\xaa': 57900, '2': 52053, '\xaf': 48545, '\x9c': 40084, '\x8b': 39166, '\x97': 34319, '\x80': 33692, '\xb9': 28896, '\xb6': 28469, '3': 27054, '-': 26215, '\xad': 21478, '\x9a': 20881, '\xa1': 18973, '\x93': 18762, '4': 18452, '\xbc': 17200, '\x82': 16232, '\x86': 16053, '\xab': 15814, '\xb7': 15650, '5': 13851, '\x85': 13428, '\x8f': 12714, '\x9b': 12221, '\xa3': 11851, '\x96': 11344, '6': 11068, '\x89': 10253, '7': 9279, '\xa5': 9057, '\x83': 8021, '8': 7901, '9': 7046, '\x99': 6980, '0': 6696, '\xd9': 6541, 'e': 6415, 'a': 5440, 'i': 5051, 't': 4295, 'o': 4284, 'r': 4281, 'n': 4192, 's': 4158, '\x98': 4105, '\xd8': 3949, '\xa0': 3916, '\x9e': 3608, '\xa9': 3492, '\xe2': 3092, 'l': 2996, '.': 2947, '\x88': 2875, '\x8c': 2764, 'c': 2736, '\x8e': 2622, '\x9d': 2356, 'd': 2292, 'm': 2256, 'u': 1941, 'p': 1886, '/': 1804, 'h': 1700, 'g': 1596, 'b': 1242, 'y': 1097, 'w': 1055, 'f': 947, '\x84': 868, '\xa2': 850, "'": 713, 'v': 708, '\x90': 690, 'k': 688, '\x8a': 672, ':': 608, '\x92': 517, '\xb1': 483, 'x': 304, 'j': 295, '_': 226, 'z': 207, '\xb3': 193, '\xc3': 166, 'q': 136, '+': 135, '\xb5': 100, '\xc2': 96, '\x94': 86, '@': 81, '\xb4': 76, '\xc5': 63, '\xba': 40, '\xdb': 26, '\xce': 13, '\xcf': 11, '&': 9, '\xda': 8, '\x91': 6, '\xc6': 6, '\xbd': 3, '\xbb': 2, '\xef': 1})

This piece of code gives me the frequency in descending order. but does not give me all of them. And I don't know how to use them the way I wanted to. This Piece just didn't work.

>>> for word in list1:
...     if word in list2:
...         list2.index(word)[1] += 1
...     else:
...         list2.append([word,0])
...

Next I need to web scrape to get the synonyms from Indo wordnet. I have no idea how that can be done. This is a bilingual project and I am yet to figure out the decoding systems. Hence, the unicode.

Can anyone help please?

Solution

list1 is not a list, it's a string. Try this:

list1 = open('C:\Python27\bengali_wordlist_full.txt').read().split()

Without .split() in your code Counter counts all characters in that file instead of separate words:

>>> Counter('string string')
Counter({'n': 2, 'i': 2, 't': 2, 'g': 2, 'r': 2, 's': 2, ' ': 1})
>>> Counter('string string'.split())
Counter({'string': 2})

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow