Pergunta

I am a beginner in the field of data mining and want to cluster my movie data set for finding Genres group. I have 26 different genres for 86 movies in my data set. I would like to employ clustering for grouping my movies into few genres instead of 26. So for e.g. after running some clustering algorithm , I will be left with 4 clusters or any small count that best suits my data set. I have defined my data set as follows M1 { G1,G2,.....G26} M2{G1,G2,.....G26} WHERE each of the genres G1,....,G26 can hold value either 0 or 1, 0 for being absent, 1 for being present. Now my next step is to run k-means cluster on that and I want to use a good distance function for e.g. Pearson Correlation Coefficient.

I am using MATLAB for my experiments. I tried doing k-Means using k=3,4,5,6 Also I ran Hierarchial Clustering.

I am unsure how to determine which clustering results are better. How to check that? As I am a beginner, I dont know how to plot clusters for binary features in MATLAB. Also I donot have knowledge how to use Pearson Correlation Coefficient as a distance metric in k-Means. Please help.

Foi útil?

Solução

Evaluation is the hardest part with respect to clustering.

If you knew what you are looking for, you would not need to run cluster analysis.

So there is no such thing as an objective "truth" for clustering. What you consider a cluster depends a lot of what your personal needs are, and unless you encode them into a custom algorithm, chances are that the clustering algorithm computes something entirely different.

k-means for example minimizes the variances. Whether or not variance agrees with your idea of a cluster!

For your use case, the best sanity check is that each of the existing genre assignments should be mostly within one of the clusters. If it's all over the place, the clustering does not cluster by your notion of genres.

Outras dicas

If you have no ground truth then there is no particular way to measure how successful your clustering was.

So assuming you don't have a ground truth you could use intra cluster similarity; this is when you measure the similarity of nodes inside each cluster. I'd take a look at mean shift clustering as you don't need to specify the number of clusters.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top