Question

I have a series type object where i have to apply a function that uses bigrams to correct the word in case it occurs with another one. I created a bigrams list , sorted it according to frequency (highest comes first) and called it fdist .

bigrams = [b for l in text2 for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
freq = nltk.FreqDist(bigrams) #computes freq of occurrence
fdist = freq.keys() # sorted according to freq

Next ,I created a function that accepts each line ("or sentence","object of a list") and uses the bigram to decide whether to correct it further or not.

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for word1, word2 in zip(words[:-1], words[1:]): #generate 2 words at a time words 1,2 followed by 2,3 3,4 and so on
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
               word1=i #replace
               return word1 #return word

The problem is that only a single word is returned for an entire sentence , eg :
"Lts go twards the east is" replaced by lets . It looks that further iterations arent working.
The for loop for word1, word2 works this way : "Lts go" in 1st iteration, which will be eventually replaced by "lets" as lets occurs more frequently with "go"

"go towards" in 2nd iteration.

"towards the" in 3rd iteration.. and so on.

There is a minor error which i cant figure out , please help.

Was it helpful?

Solution

Sounds like you're doing word1 = i with the expectation that this will modify the contents of words. But this won't happen. If you want to modify words, you'll have to do so directly. Use enumerate to keep track of word1's index.

As 2rs2ts pointed out, you're returning early. If you want the inner loop to terminate once you find the first good replacement, break instead of returning. Then return at the end of the function.

def bigram_corr(line): #function with input line(sentence)
    words = line.split() #split line into words
    for idx, (word1, word2) in enumerate(zip(words[:-1], words[1:])):
        for i,j in fdist: #iterate over bigrams
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3): #if 2nd words of both match, and 1st word is at an edit distance of 2 or 1, replace word with highest occurring bigram
                words[idx] = i
                break
    return " ".join(words)

OTHER TIPS

The return statement halts the function entirely. I think what you want is:

def bigram_corr(line):
    words = line.split()
    words_to_return = []
    for word1, word2 in zip(words[:-1], words[1:]):
        for i,j in fdist:
            if (word2==j) and (jf.levenshtein_distance(word1,i) < 3):
               words_to_return.append(i)
    return ' '.join(words_to_return)

This puts each of the words which you have processed into a list, then rejoins them with spaces and returns that entire string, since you said something about returning "the entire sentence."

I am not sure if the semantics of your code are correct, since I don't have the jf library or whatever it is that you're using and therefore I can't test this code, so this may or may not solve your problem entirely. But this will help.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top