How can be estimate the total number of features?

https://stackoverflow.com/questions/21557249

06-10-2022
|

Question

If i have 1000 tokens(i assume tokens are features after preprocessing dataset), then how many bigram features would be generated from 1000 tokens(words) ? is it each token would have a bigram combination with all other tokens in vocabulary ?

i am asking this question as i have to pre-fill the number of words to keep in vocabulary in weka

Solution

You cannot precompute this based just on the number of tokens. Bigrams are pairs of tokens which occur side-by-side (it is a term from n-gram models, where you have a notion of sequence). So in order to compute number of bigrams you have to slide a 2-token window through your data and check how many different pairs you find.

If you have N tokens coming from some data X, you can only say, that number of bigrams B is bounded as follows: N <= B <= N^2, but the exact number requires the procedure outlined above.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow