How to set sklearn CountVectorizer to include non-alphanumeric characters as the feature extraction?

https://stackoverflow.com/questions/20413651

regex
tokenize
nltk
feature-extraction
scikit-learn

29-08-2022
|

Question

By using NLTK whitespacetokenizer I am able to curate a vocabulary with non-alphanumeric terms but at the transform step those terms are not counted and they are 0 at all the feature vectors. Thus the problem is even I tokenized the docs with simple white space partition, I also need to change the tokenpattern of the CountVectorizer. However, I am unable to figure out which kind of regular pattern should I use? Any idea?

La solution

From your confusion, it seems that what you need is to learn RegEx (here).

If you want the token to match everything, you can set the token_pattern attribute in CountVectorizer as:

.*

Meaning it will match every token coming from the tokenizer.

If you just want to match non-alphanumeric tokens, you can use:

[^A-Za-z0-9]*

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow