Pregunta

By using NLTK whitespacetokenizer I am able to curate a vocabulary with non-alphanumeric terms but at the transform step those terms are not counted and they are 0 at all the feature vectors. Thus the problem is even I tokenized the docs with simple white space partition, I also need to change the tokenpattern of the CountVectorizer. However, I am unable to figure out which kind of regular pattern should I use? Any idea?

¿Fue útil?

Solución

From your confusion, it seems that what you need is to learn RegEx (here).

If you want the token to match everything, you can set the token_pattern attribute in CountVectorizer as:

.*

Meaning it will match every token coming from the tokenizer.

If you just want to match non-alphanumeric tokens, you can use:

[^A-Za-z0-9]*
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top