Question

I'm new to R and data analysis. I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price which users clicked at.

c165c2ee-81cf-48cf-ba3f-83b70204c00c    161785  124.0
a886fdd5-7cee-4152-b1b7-77a2702687b0    643339  42.0
5e5fd670-b104-445b-a36d-b3798cd43279    131332  38.0
888d736f-99bc-49ca-969d-057e7d4bb8d1    1032763 39.0

I would like to apply cluster analysis to that data.

If I try to apply k-means clustering to my data:

> q <- kmeans(dat, centers=25)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dat, centers = 25) : NAs introduced by coercion

If I try to apply hierarchial clustering to the data:

> m <- as.matrix(dat)
> d <- dist(m)   # find distance matrix
Warning message:
In dist(m) : NAs introduced by coercion

The "NAs introduced by coercion" seems to happen as a first column is not a number. So, I've tried to run the code against dat[-1] but result is the same.

What am I missing or doing wrong?

Thanks a lot in advance.

=== UPDATE #1 ===

Output on str and factor:

> str(dat)
'data.frame':   14634 obs. of  3 variables:
 $ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
 $ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
 $ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...

> dat[,1] = factor(dat[,1])
> str(dat)
'data.frame':   14634 obs. of  3 variables:
 $ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
 $ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
 $ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...

> dd <- dist(dat)
Warning message:
In dist(dat) : NAs introduced by coercion
> hc <- hclust(dd)                # apply hirarchical clustering
Error in hclust(dd) : NA/NaN/Inf in foreign function call (arg 11)

=== UPDATE #2 ===

I would not like to remove the first column as there could be multiple clicks for the same user which I consider to be important for the analysis.

Was it helpful?

Solution

It sounds like you want to retain the first column (even though 10062 levels for 14634 observations is quite high). The way to convert a factor to numeric values is with the model.matrix function. Before converting your factor:

data(iris)
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

After model.matrix:

head(model.matrix(~.+0, data=iris))
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1          5.1         3.5          1.4         0.2             1                 0                0
# 2          4.9         3.0          1.4         0.2             1                 0                0
# 3          4.7         3.2          1.3         0.2             1                 0                0
# 4          4.6         3.1          1.5         0.2             1                 0                0
# 5          5.0         3.6          1.4         0.2             1                 0                0
# 6          5.4         3.9          1.7         0.4             1                 0                0

As you can see, it expands out your factor values. So you could then run k-means clustering on the expanded version of your data:

kmeans(model.matrix(~.+0, data=iris), centers=3)
# K-means clustering with 3 clusters of sizes 49, 50, 51
# 
# Cluster means:
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1     6.622449    2.983673     5.573469    2.032653             0         0.0000000       1.00000000
# 2     5.006000    3.428000     1.462000    0.246000             1         0.0000000       0.00000000
# 3     5.915686    2.764706     4.264706    1.333333             0         0.9803922       0.01960784
# ...

OTHER TIPS

Try dat[,1] = factor(dat[,1]). I think NA is from the session id (first column) which is not number. factor would make session id to be indexed.

k-means only works for continuous data.

You have two id columns that must not be used for clustering; they will make your result meaningless.

But even then I doubt that k-means is the appropriate algorithm for your problem. You first need to understand your data, then preprocess and transform it into an appropriate representation.

Don't expect a push-button solution. These don't exist / work.

Don't use SPECIE column

km<- kmeans(iris[,1:4],3)

km

K-means clustering with 3 clusters of sizes 50, 38, 62

Cluster means:

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     6.850000    3.073684     5.742105    2.071053
3     5.901613    2.748387     4.393548    1.433871

Clustering vector:

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2
[117] 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3

Within cluster sum of squares by cluster:

[1] 15.15100 23.87947 39.82097

(between_SS / total_SS = 88.4 %)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top