Spelling corrector for non-English characters

https://stackoverflow.com/questions/19601942

01-07-2022
|

Question

Having read Peter Norvig's How to write a spelling corrector I tried to make the code work for Persian. I rewrote the code like this:

import re, collections

def normalizer(word):
    word = word.replace('ي', 'ی')
    word = word.replace('ك', 'ک')
    word = word.replace('ٔ', '')
    return word

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(normalizer(open("text.txt", encoding="UTF-8").read()))

alphabet = 'ا آ ب پ ت ث ج چ ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی ء'

In Norvig's original code, NWORDS is the dictionary that records the words and their number of occurrences in the text. I tried print (NWORDS) to see if it works with the Persian characters but the result is irrelevant. It doesn't count words, it counts the appearance of separate letters.

Does anyone have any idea where the code went wrong?

P.S. 'text.txt' is actually a long concatenation of Persian texts, like its equivalent in Norvig's code.

Solution

You are applying normalizer to the file object.

I suspect you really want to be doing something like this

with open('text.txt') as fin:
    Nwords = trian(normalizer(word) for ln in fin for word in ln.split()))

I would also look into using Counter http://docs.python.org/2/library/collections.html#collections.Counter

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow