Question

Field-aware factorization machines (FFM) have proved to be useful in click-through rate prediction tasks. One of their strengths comes from the hashing trick (feature hashing).

When one uses hashing trick from sci-kit-learn, one ends up with a sparse matrix.

How can then one work with such a sparse matrix to still implement field-aware factorization machines? SKLearn does not have an implementation of FFM.

EDIT 1: I want to perform feature-hashing/hashing-trick for sure in order to be able to scale FFM to millions of features.

EDIT 2: Pandas is not able to scale to many fields. I also want to convert an arbitrary CSV (containing numerical and categorical features) into LIBFFM (field:index:value) format and perform hashing trick at the same time (preferably without using Pandas). Pandas2FFM does not scale even after performing the Hashing Trick.

Was it helpful?

Solution

One option is to use xLearn, a scikit-learn compatible package for FFM, which handles that issue automatically.

If you require feature hashing, you can write a custom feature hashing function:

import hashlib

def hash_str(string: str, n_bins: int) -> int:
    return int(hashlib.md5(string.encode('utf8')).hexdigest(), 16) % (n_bins-1) + 1

OTHER TIPS

I normally dont use sklearn for the encodings but "category encoders package":

Have you consider using their Hashing Encoder?:

The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

The output are int64 features. Category encoders API es easy to use and can be implemented in a transformer

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top