How to set sklearn CountVectorizer to include non-alphanumeric characters as the feature extraction?

https://stackoverflow.com/questions/20413651

regex
tokenize
nltk
feature-extraction
scikit-learn

29-08-2022
|

Pregunta

By using NLTK whitespacetokenizer I am able to curate a vocabulary with non-alphanumeric terms but at the transform step those terms are not counted and they are 0 at all the feature vectors. Thus the problem is even I tokenized the docs with simple white space partition, I also need to change the tokenpattern of the CountVectorizer. However, I am unable to figure out which kind of regular pattern should I use? Any idea?

Solución

From your confusion, it seems that what you need is to learn RegEx (here).

If you want the token to match everything, you can set the token_pattern attribute in CountVectorizer as:

.*

Meaning it will match every token coming from the tokenizer.

If you just want to match non-alphanumeric tokens, you can use:

[^A-Za-z0-9]*

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow