Question

If i have 1000 tokens(i assume tokens are features after preprocessing dataset), then how many bigram features would be generated from 1000 tokens(words) ? is it each token would have a bigram combination with all other tokens in vocabulary ?

i am asking this question as i have to pre-fill the number of words to keep in vocabulary in weka

Était-ce utile?

La solution

You cannot precompute this based just on the number of tokens. Bigrams are pairs of tokens which occur side-by-side (it is a term from n-gram models, where you have a notion of sequence). So in order to compute number of bigrams you have to slide a 2-token window through your data and check how many different pairs you find.

If you have N tokens coming from some data X, you can only say, that number of bigrams B is bounded as follows: N <= B <= N^2, but the exact number requires the procedure outlined above.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top