You're right that vocabulary
is what you want. It works like this:
>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 1]], dtype=int64)
So you pass it a dict with your desired features as the keys.
If you used CountVectorizer
on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_
attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do
newVec = CountVectorizer(vocabulary=vec.vocabulary_)
to create a new tokenizer using the vocabulary from your first one.