Question

By using NLTK whitespacetokenizer I am able to curate a vocabulary with non-alphanumeric terms but at the transform step those terms are not counted and they are 0 at all the feature vectors. Thus the problem is even I tokenized the docs with simple white space partition, I also need to change the tokenpattern of the CountVectorizer. However, I am unable to figure out which kind of regular pattern should I use? Any idea?

Was it helpful?

Solution

From your confusion, it seems that what you need is to learn RegEx (here).

If you want the token to match everything, you can set the token_pattern attribute in CountVectorizer as:

.*

Meaning it will match every token coming from the tokenizer.

If you just want to match non-alphanumeric tokens, you can use:

[^A-Za-z0-9]*
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top