I have a dataset where each sample is a list of ordered items, lets say grocery list , and a label from 6 categories . each list can have up to 120 items but the mean items is 12 items in a list.

i would like to embed each sample in a similar way to word2vec and maybe average the vectors for each list to get 1 vector per sample.

You think that word2vec is the right approach for it ? since the "texts" are very short and the location of each item is not contextual but more like the number purchased

有帮助吗?

解决方案

So, you want to make embeddings for sequences of items, and order doesn't make much sense. You don't have a specific objective, just want to retrieve embeddings with some conventional properties, such as natural clustering.

If you can train word2vec and take mean of item embeddings in a cart, you'll probably get noisy vectors. It doesn't work as well as expected, from my experience. I suggest that you try dimensionality reduction methods, such as NMF. Fit it with a sparse binary matrix of shape (num_samples, num_items). You'll get transformed vectors of desired dimensionality.

Then you might want to evaluate embeddings somehow. If don't have any quality metrics, you can either perform clustering to see how your embeddings form groups, or make nearest neighbour search queries by hand.

Good luck!

许可以下: CC-BY-SA归因
scroll top