Pergunta

Field-aware factorization machines (FFM) have proved to be useful in click-through rate prediction tasks. One of their strengths comes from the hashing trick (feature hashing).

When one uses hashing trick from sci-kit-learn, one ends up with a sparse matrix.

How can then one work with such a sparse matrix to still implement field-aware factorization machines? SKLearn does not have an implementation of FFM.

EDIT 1: I want to perform feature-hashing/hashing-trick for sure in order to be able to scale FFM to millions of features.

EDIT 2: Pandas is not able to scale to many fields. I also want to convert an arbitrary CSV (containing numerical and categorical features) into LIBFFM (field:index:value) format and perform hashing trick at the same time (preferably without using Pandas). Pandas2FFM does not scale even after performing the Hashing Trick.

Foi útil?

Solução

One option is to use xLearn, a scikit-learn compatible package for FFM, which handles that issue automatically.

If you require feature hashing, you can write a custom feature hashing function:

import hashlib

def hash_str(string: str, n_bins: int) -> int:
    return int(hashlib.md5(string.encode('utf8')).hexdigest(), 16) % (n_bins-1) + 1

Outras dicas

I normally dont use sklearn for the encodings but "category encoders package":

Have you consider using their Hashing Encoder?:

The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

The output are int64 features. Category encoders API es easy to use and can be implemented in a transformer

Licenciado em: CC-BY-SA com atribuição
scroll top