Question

Introduction

Let's say I have a dataset of different observation of different people and I want to group people together to know which person is closest to the other one. I also want to have a measure to know how close they are to each others and know the statistical significance.

Data

       eat_rate drink_rate   sleep_rate    play_rate  name   game
1  0.0542192259 0.13041721 5.013682e-03 1.023533e-06  Paul Rayman
4  0.0688171511 0.01050611 6.178833e-03 3.238838e-07  Paul  Mario
6  0.0928997660 0.01828468 9.321211e-03 3.525951e-07  Jenn  Mario
7  0.0001631273 0.02212345 7.061524e-05 1.531270e-07  Jean   FIFA
8  0.0028735509 0.05414688 1.341689e-03 4.533366e-07  Mark   FIFA
10 0.0034844717 0.09152440 4.589990e-04 5.802708e-07  Mark Rayman
11 0.0340738956 0.03384180 1.636508e-02 1.354973e-07  Mark   FIFA
12 0.0266112679 0.20002020 3.380704e-02 4.533366e-07  Mark  Sonic
14 0.0046597056 0.01848672 5.472681e-04 4.034696e-07  Paul   FIFA
15 0.0202715299 0.16365289 2.994086e-02 4.044770e-07 Lucas   SSBM

Reproduce it:

structure(list(eat_rate = c(0.0542192259374624, 0.0688171511010916, 
0.0928997659570807, 0.000163127341146237, 0.00287355085557602, 
0.00348447171120939, 0.0340738956099744, 0.0266112679045701, 
0.00465970561072008, 0.0202715299408583), drink_rate = c(0.130417213859986, 
0.0105061117284574, 0.0182846752197192, 0.0221234468128094, 0.0541468835235882, 
0.0915243964036772, 0.0338418022022427, 0.200020204061016, 0.0184867158298818, 
0.163652894231741), sleep_rate = c(0.00501368170182717, 0.00617883308323771, 
0.00932121105128431, 7.06152352370024e-05, 0.00134168946950305, 
0.000458999029040516, 0.0163650807661753, 0.0338070438697149, 
0.000547268073086768, 0.029940859740489), play_rate = c(1.02353325645595e-06, 
3.23883801132467e-07, 3.52595117873603e-07, 1.53127022619393e-07, 
4.53336580123204e-07, 5.80270822557701e-07, 1.35497266725713e-07, 
4.53336580123204e-07, 4.03469556309652e-07, 4.04476970932148e-07
), name = structure(c(5L, 5L, 2L, 1L, 4L, 4L, 4L, 4L, 5L, 3L), .Label = c("Jean", 
"Jenn", "Lucas", "Mark", "Paul"), class = "factor"), game = structure(c(3L, 
2L, 2L, 1L, 1L, 3L, 1L, 4L, 1L, 5L), .Label = c("FIFA", "Mario", 
"Rayman", "Sonic", "SSBM"), class = "factor")), .Names = c("eat_rate", 
"drink_rate", "sleep_rate", "play_rate", "name", "game"), row.names = c(1L, 
4L, 6L, 7L, 8L, 10L, 11L, 12L, 14L, 15L), class = "data.frame")

Question

Given a dataset as fellow (with continuous and categorical feature), how can I know if a person (a categorical answer) identified by a name is more correlated to another person?

Was it helpful?

Solution

One way is to normalize your quantitative values (play, eat, drink, sleep rates) so they all have the same range (say, 0 -> 1), then assign each game to its own "dimension", that takes value 0 or 1. Turn each row into a vector and normalize the length to 1. Now, you can compare the inner product of any two people's normalized vectors as a measure of similarity. Something like this is used in text mining quite often


R Code for Similarity Matrix

Assumes you've saved your dataframe to the variable "D"

#Get normalization factors for quantitative measures
maxvect<-apply(D[,1:4],MARGIN=2,FUN=max)
minvect<-apply(D[,1:4],MARGIN=2,FUN=min)
rangevect<-maxvect-minvect
#Normalize quantative factors
D_matrix <- as.matrix(D[,1:4])
NormDMatrix<-matrix(nrow=10,ncol=4)
colnames(NormDMatrix)<-colnames(D_matrix)
for (i in 1:4) NormDMatrix[,i]<-(D_matrix[,i]-minvect[i]*rep(1,10))/rangevect[i]
gamenames<-unique(D[,"game"])
#Create dimension matrix for games
Ngames<-length(gamenames)
GameMatrix<-matrix(nrow=10,ncol=Ngames)
for (i in 1:Ngames) GameMatrix[,i]<-as.numeric(D[,"game"]==gamenames[i])
colnames(GameMatrix)<-gamenames
#combine game matrix with normalized quantative matrix
People<-D[,"name"]
RowVectors<-cbind(GameMatrix,NormDMatrix)
#normalize each row vector to length of 1 and then store as a data frame with person names
NormRowVectors<-t(apply(RowVectors,MARGIN=1,FUN=function(x) x/sqrt(sum(x*x))))
dfNorm<-data.frame(People,NormRowVectors)

#create person vectors via addition of appropriate row vectors
PersonMatrix<-array(dim=c(length(unique(People)),ncol(RowVectors)))
rownames(PersonMatrix)<-unique(People)
for (p in unique(People)){
  print(p)
  MatchIndex<-(dfNorm[,1]==p)*seq(1,nrow(NormRowVectors))
  MatchIndex<-MatchIndex[MatchIndex>0]
  nclm<-length(MatchIndex)
  SubMatrix<-matrix(NormRowVectors[MatchIndex,],nrow=length(MatchIndex),ncol=dim(NormRowVectors)[2])
  CSUMS<-colSums(SubMatrix)
  NormSum<-sqrt(sum(CSUMS*CSUMS))
  PersonMatrix[p,]<-CSUMS/NormSum
}
colnames(PersonMatrix)<-colnames(NormRowVectors)
#Calculate matrix of dot products
Similarity<-(PersonMatrix)%*%t(PersonMatrix)

OTHER TIPS

Despite normalized euclidean distance you can also have a look at the pearson distance as a similarity measure. Here is a neat description : http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/pear.html

  • You might want to normalize all the continuous variables into one range (0-1)
  • Normalize the categorical variables as a One Hot Enconder
  • Apply Similarity algorithms like Pearson Correlation / Distance algorithms like (Euclidean, Cosine similarity)
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top