How to convert vector values to fit k-means algorithm function?

https://datascience.stackexchange.com/questions/6830

16-10-2019
|

Question

I have a set of user objects that I want to group using a $k$-means function from their quiz answers. Each quiz question had predefined answers with letter values "a", "b", "c", "d". If a user answers the question #1 with letter "b", I put this answer into vector $(0, 1, 0, 0)$. The $k$-means function I have to use takes a two-dimensional array of numbers as an input vector (in this case array[user][question]), and I can't figure out how to use it, because, instead of a number value representing a user's answer to question, I have a vector input. How can I convert my vector values to numbers so that I can use my $k$-means function?

Solution

You are 95% there, you just have one hangup...

The vectorization that you are doing is alternatively known as binarization or one-hot encoding. The only thing you need to do now is break apart all of those vectors and think of them as individual features.

So instead of thinking of the question one vector as $(0,0,1,0)$ and the question two vector as $(0,1,0,0)$, you can now think of them as individual features.

So this:

-      q1,        q2
-      (a,b,c,d), (a,b,c,d)
user1  (0,0,1,0), (0,1,0,0)
user2  (1,0,0,0), (0,0,0,1)

Becomes this:

-      q1a,q1b,q1c,q1d,q2a,q2b,q2c,q2d
user1  0   0   1   0   0   1   0   0
user2  1   0   0   0   0   0   0   1

And you can think of each one of those binary features as an orthogonal dimension in your data that lies in a 8-dimensional space.

Hope this helps!

OTHER TIPS

A two dimensional array is a list of vectors, so

{{userid1,1a,1b,1c,1d,2a,2b,2c,2d,...,na,nb,nc,nd}
{userid2,1a,1b,1c,1d,2a,2b,2c,2d,...,na,nb,nc,nd},
...,
{useridk,1a,1b,1c,1d,2a,2b,2c,2d,...,na,nb,nc,nd}}

would be a suitable input for a test with n questions and k contestants, where 1a represents the response a for question one.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange