SPARK 1.5.1: Convert multi-labeled data into binary vector

https://datascience.stackexchange.com/questions/8717

16-10-2019
|

Question

I am using SPARK 1.5.1, and I have DataFrame that looks like follow:

labelsCol, featureCol
(Label1, Label2, Label 32), FeatureVector
(Label1, Label10, Label16, Label30, Label48), FeatureVector
...
(Label1, label 95), FeatureVector

The first column is the list of labels for that sample, and in total I have 100 label.

I would like to build a binary classifier for each label, so I want to transform the labels list column into a binary vector.

The binary vector will have a length of 100 and the value will be 0 or 1 depends on the existence of the label for sample.

Is there any strait forward solution for this?

Solution

Spark only recently implemented CountVectorizer, which will take the labels (as strings) and encode them as your 100-dimensional vector (assuming all 100 labels show up somewhere in your dataset). Once you have those vectors, it should be a simple step to threshold them to make them 0/1 instead of a frequency.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange