質問

When defining the conditional probability, he took a shortcut:

So I took a shortcut: I defined a trivial model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. By "known word" I mean a word that we have seen in the language model training data -- a word in the dictionary. We can implement this strategy as follows:

def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

I don't see how this code implements his strategy. To me the last line of return is just returing the word has a highest counts/prior, instead of the priority list in his model.

and also in defining his word counting dictionary:

def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
    model[f] += 1
return model

Why didn't he start from 0? I mean shouldn't the default_factory be (lambda:0) or (int)?

Can anyone explain? You can find the full article here: http://norvig.com/spell-correct.html

Thanks

役に立ちましたか?

解決

The priority list is implemented by the or. If known([word]) is non-empty set, its value is the value of the expression. If it's empty, the right-hand side

known(edits1(word)) or known_edits2(word) or [word]

is evaluated. E.g.

>>> [1, 2, 3] or [4, 5, 6]
[1, 2, 3]
>>> [] or [4, 5, 6]
[4, 5, 6]

Why didn't he start from 0?

That's Laplace smoothing. It's actually explained in the article.

他のヒント

Regarding the first question, the priority order is implemented in the line.

candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]

It's going to implement only one of those lists, not the union of them. Here's a simpler example that shows how it works.

>>> n1 = [1,2,3]
>>> n2 = [2,3,4]
>>> n1 or n2
[1, 2, 3]
>>> [] or n2
[2, 3, 4]
>>> 

Not sure about the defaultdict part, but looks like larsmans already answered that.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top