Why is my scikit learn HashingVectorizor giving me floats with binary = True set?

Question 1

Under the default settings, HashingVectorizer normalizes your feature vectors to unit Euclidean length:

>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027,  0.        ,  0.        ,  0.        ,  0.57735027,
         0.        , -0.57735027,  0.        ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0

Setting binary=True only postpones this normalization until after binarizing the features, i.e. setting all the non-zero ones to one. You also have to set norm=None to turn it off:

>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5,  0. ,  0. ,  0. ,  0.5,  0.5,  0.5,  0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.]])

This is also why it's returning float arrays: normalization requires them. While the vectorizer could be rigged to return another dtype, that would require conversion inside the transform method and probably one back to float in the next estimator.

Question 2

To replace CountVectorizer(binary=True) by HashingVectorizer the proper parameters are: norm=None (default "l2"), alternate_sign=False (default True) and binary=True (default False).

However, if you require the output with the same dtype as from CountVectorizer you can specify dtype="int64" (default "float64").

Furthermore, dtype="uint8" is the optimal dtype when binary=True and will save you a lot of memory:

>>> from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
>>> 
>>> cv = CountVectorizer(binary=True)
>>> hv = HashingVectorizer(norm=None, alternate_sign=False, binary=True, dtype='uint8')
>>> 
>>> doc = "one two three two one"
>>> cv_result = cv.fit_transform([doc])
>>> hv_result = hv.transform([doc])
>>> 
>>> print(repr(cv_result))
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> print(cv_result)
  (0, 0)    1
  (0, 2)    1
  (0, 1)    1
>>> print(f'used: {(cv_result.data.nbytes + cv_result.indptr.nbytes + cv_result.indices.nbytes)} bytes\n')
used: 44 bytes

>>> 
>>> print(repr(hv_result))
<1x1048576 sparse matrix of type '<class 'numpy.uint8'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> print(hv_result)
  (0, 824960)   1
  (0, 884299)   1
  (0, 948532)   1
>>> print(f'used: {(hv_result.data.nbytes + hv_result.indptr.nbytes + hv_result.indices.nbytes)} bytes')
used: 23 bytes