C# algorithm for N-gram
-
26-09-2019 - |
Question
I am intending to use the n-gram code from this article. The algorithm produces these tri-gram results:
t, th, the, he, e, q, qu, qui, uic, ick, ck, k, r, re, red, ed, d
for the text the quick red
However wikipedia, reckons it should be:
the qui k_r
he_ uic _re
e_q ick red
_qu ck_
(space indicated by ‘_’).
What is correct? Are there any other C# implementation out there?
Solution
The second example is correct.
ps. Why do you generate trigrams for the complete text and not only for words? What is your use case?
OTHER TIPS
The first is correct. I uses character N-gram on my thesis. You must move forward and pass one character for each step. In this condition, similar words can be found.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow