Under the default settings, HashingVectorizer
normalizes your feature vectors to unit Euclidean length:
>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027, 0. , 0. , 0. , 0.57735027,
0. , -0.57735027, 0. ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0
Setting binary=True
only postpones this normalization until after binarizing the features, i.e. setting all the non-zero ones to one. You also have to set norm=None
to turn it off:
>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5, 0. , 0. , 0. , 0.5, 0.5, 0.5, 0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1., 0., 0., 0., 1., 1., 1., 0.]])
This is also why it's returning float
arrays: normalization requires them. While the vectorizer could be rigged to return another dtype, that would require conversion inside the transform
method and probably one back to float in the next estimator.