How to find similarity between different factors in a dataset

https://datascience.stackexchange.com/questions/6232

16-10-2019
|

Question

Introduction

Let's say I have a dataset of different observation of different people and I want to group people together to know which person is closest to the other one. I also want to have a measure to know how close they are to each others and know the statistical significance.

Data

       eat_rate drink_rate   sleep_rate    play_rate  name   game
1  0.0542192259 0.13041721 5.013682e-03 1.023533e-06  Paul Rayman
4  0.0688171511 0.01050611 6.178833e-03 3.238838e-07  Paul  Mario
6  0.0928997660 0.01828468 9.321211e-03 3.525951e-07  Jenn  Mario
7  0.0001631273 0.02212345 7.061524e-05 1.531270e-07  Jean   FIFA
8  0.0028735509 0.05414688 1.341689e-03 4.533366e-07  Mark   FIFA
10 0.0034844717 0.09152440 4.589990e-04 5.802708e-07  Mark Rayman
11 0.0340738956 0.03384180 1.636508e-02 1.354973e-07  Mark   FIFA
12 0.0266112679 0.20002020 3.380704e-02 4.533366e-07  Mark  Sonic
14 0.0046597056 0.01848672 5.472681e-04 4.034696e-07  Paul   FIFA
15 0.0202715299 0.16365289 2.994086e-02 4.044770e-07 Lucas   SSBM

Reproduce it:

structure(list(eat_rate = c(0.0542192259374624, 0.0688171511010916, 
0.0928997659570807, 0.000163127341146237, 0.00287355085557602, 
0.00348447171120939, 0.0340738956099744, 0.0266112679045701, 
0.00465970561072008, 0.0202715299408583), drink_rate = c(0.130417213859986, 
0.0105061117284574, 0.0182846752197192, 0.0221234468128094, 0.0541468835235882, 
0.0915243964036772, 0.0338418022022427, 0.200020204061016, 0.0184867158298818, 
0.163652894231741), sleep_rate = c(0.00501368170182717, 0.00617883308323771, 
0.00932121105128431, 7.06152352370024e-05, 0.00134168946950305, 
0.000458999029040516, 0.0163650807661753, 0.0338070438697149, 
0.000547268073086768, 0.029940859740489), play_rate = c(1.02353325645595e-06, 
3.23883801132467e-07, 3.52595117873603e-07, 1.53127022619393e-07, 
4.53336580123204e-07, 5.80270822557701e-07, 1.35497266725713e-07, 
4.53336580123204e-07, 4.03469556309652e-07, 4.04476970932148e-07
), name = structure(c(5L, 5L, 2L, 1L, 4L, 4L, 4L, 4L, 5L, 3L), .Label = c("Jean", 
"Jenn", "Lucas", "Mark", "Paul"), class = "factor"), game = structure(c(3L, 
2L, 2L, 1L, 1L, 3L, 1L, 4L, 1L, 5L), .Label = c("FIFA", "Mario", 
"Rayman", "Sonic", "SSBM"), class = "factor")), .Names = c("eat_rate", 
"drink_rate", "sleep_rate", "play_rate", "name", "game"), row.names = c(1L, 
4L, 6L, 7L, 8L, 10L, 11L, 12L, 14L, 15L), class = "data.frame")

Question

Given a dataset as fellow (with continuous and categorical feature), how can I know if a person (a categorical answer) identified by a name is more correlated to another person?

Solution

One way is to normalize your quantitative values (play, eat, drink, sleep rates) so they all have the same range (say, 0 -> 1), then assign each game to its own "dimension", that takes value 0 or 1. Turn each row into a vector and normalize the length to 1. Now, you can compare the inner product of any two people's normalized vectors as a measure of similarity. Something like this is used in text mining quite often

R Code for Similarity Matrix

Assumes you've saved your dataframe to the variable "D"

#Get normalization factors for quantitative measures
maxvect<-apply(D[,1:4],MARGIN=2,FUN=max)
minvect<-apply(D[,1:4],MARGIN=2,FUN=min)
rangevect<-maxvect-minvect
#Normalize quantative factors
D_matrix <- as.matrix(D[,1:4])
NormDMatrix<-matrix(nrow=10,ncol=4)
colnames(NormDMatrix)<-colnames(D_matrix)
for (i in 1:4) NormDMatrix[,i]<-(D_matrix[,i]-minvect[i]*rep(1,10))/rangevect[i]
gamenames<-unique(D[,"game"])
#Create dimension matrix for games
Ngames<-length(gamenames)
GameMatrix<-matrix(nrow=10,ncol=Ngames)
for (i in 1:Ngames) GameMatrix[,i]<-as.numeric(D[,"game"]==gamenames[i])
colnames(GameMatrix)<-gamenames
#combine game matrix with normalized quantative matrix
People<-D[,"name"]
RowVectors<-cbind(GameMatrix,NormDMatrix)
#normalize each row vector to length of 1 and then store as a data frame with person names
NormRowVectors<-t(apply(RowVectors,MARGIN=1,FUN=function(x) x/sqrt(sum(x*x))))
dfNorm<-data.frame(People,NormRowVectors)

#create person vectors via addition of appropriate row vectors
PersonMatrix<-array(dim=c(length(unique(People)),ncol(RowVectors)))
rownames(PersonMatrix)<-unique(People)
for (p in unique(People)){
  print(p)
  MatchIndex<-(dfNorm[,1]==p)*seq(1,nrow(NormRowVectors))
  MatchIndex<-MatchIndex[MatchIndex>0]
  nclm<-length(MatchIndex)
  SubMatrix<-matrix(NormRowVectors[MatchIndex,],nrow=length(MatchIndex),ncol=dim(NormRowVectors)[2])
  CSUMS<-colSums(SubMatrix)
  NormSum<-sqrt(sum(CSUMS*CSUMS))
  PersonMatrix[p,]<-CSUMS/NormSum
}
colnames(PersonMatrix)<-colnames(NormRowVectors)
#Calculate matrix of dot products
Similarity<-(PersonMatrix)%*%t(PersonMatrix)

OTHER TIPS

Despite normalized euclidean distance you can also have a look at the pearson distance as a similarity measure. Here is a neat description : http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/pear.html

You might want to normalize all the continuous variables into one range (0-1)
Normalize the categorical variables as a One Hot Enconder
Apply Similarity algorithms like Pearson Correlation / Distance algorithms like (Euclidean, Cosine similarity)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange