How to set sklearn CountVectorizer to include non-alphanumeric characters as the feature extraction?

https://stackoverflow.com/questions/20413651

regex
tokenize
nltk
feature-extraction
scikit-learn

29-08-2022
|

문제

By using NLTK whitespacetokenizer I am able to curate a vocabulary with non-alphanumeric terms but at the transform step those terms are not counted and they are 0 at all the feature vectors. Thus the problem is even I tokenized the docs with simple white space partition, I also need to change the tokenpattern of the CountVectorizer. However, I am unable to figure out which kind of regular pattern should I use? Any idea?

해결책

From your confusion, it seems that what you need is to learn RegEx (here).

If you want the token to match everything, you can set the token_pattern attribute in CountVectorizer as:

.*

Meaning it will match every token coming from the tokenizer.

If you just want to match non-alphanumeric tokens, you can use:

[^A-Za-z0-9]*

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow