Question

By using NLTK whitespacetokenizer I am able to curate a vocabulary with non-alphanumeric terms but at the transform step those terms are not counted and they are 0 at all the feature vectors. Thus the problem is even I tokenized the docs with simple white space partition, I also need to change the tokenpattern of the CountVectorizer. However, I am unable to figure out which kind of regular pattern should I use? Any idea?

Était-ce utile?

La solution

From your confusion, it seems that what you need is to learn RegEx (here).

If you want the token to match everything, you can set the token_pattern attribute in CountVectorizer as:

.*

Meaning it will match every token coming from the tokenizer.

If you just want to match non-alphanumeric tokens, you can use:

[^A-Za-z0-9]*
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top