Clustering data set with multiple dimensions

https://datascience.stackexchange.com/questions/67507

08-12-2020
|

Question

I have a data set which is similar to the following:

It is recipe data along with the composition of the recipe (in %)

I have 91 recipes and 40 ingredients in total. I want to be able to cluster these recipes together into families based on similarity of ingredient composition.

How would I achieve this? Which clustering method can be used and how?

Solution

welcome to the community.

There are many criteria on the basis of which you can cluster the recipes. The usual way to do this is to represent recipes in terms of vectors, so each of your 91 recipes can be represented by vectors of 40 dimensions. This means that now the system or machine will identify your recipes as vectors in a 40-dimensional space.

Now, to check the "similarity" between the recipes you have two of the most common metrics, one is the euclidean distance. Check it out:- https://en.wikipedia.org/wiki/Euclidean_distance

The other is the cosine similarity. Check it out:- https://en.wikipedia.org/wiki/Cosine_similarity

Coming back to how to cluster the data, you can use KMeans, it is an unsupervised algorithm. The only thing you need to input here is how many clusters you want. Scikit-Learn in Python has a very good implementation of KMeans. Visit this link.

However, there are two conditions:- 1) As said before, it needs the number of clusters as an input. 2) It is a Euclidean distance-based algorithm and NOT a cosine similarity-based.

A better alternative to this is Hierarchical clustering. It creates the clusters in a top-down approach(divisive) or bottom-up approach(agglomerative) recursively. Read about it here. It is better than KMeans in two ways:-

1) You have some flexibility on how to cut the recursion to obtain the clusters on the basis of number of clusters you want like KMeans or on the basis of the distance between cluster representatives.

2) You can also choose among various similarity criteria or affinity, like euclidean distance, cosine similarity, etc.

Hope this helps, Thanks.

OTHER TIPS

K-means clustering should be a good solution.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

However, in k-means clustering one must define "k", the amount of clusters. You should find the optimal "k".

One way to do is is to introduce the Elbow Method. More info about the elbow method can be found here:

https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb

Short explanation:

1) You will calculate the squared distance of each datapoint to the centroid.

2) You will sum these squared distances

Try different values of 'k', and once your sum of the squared distances start to diminish, you will choose this value of 'k' as your final value.

K-nn (k-nearest neighbors) is another algorithm I see fit

Documentation: https://scikit-learn.org/stable/modules/neighbors.html

K-nn implements learning based on the nearest neighbors (k neighbors) of each datapoint. It will assign each datapoint to a class based on the k nearest neighbors. It is a very fast algorithm.

Good thing about is, is that you don't need to define the amount of clusters yourself.

You can compare the outcome of both algorithms and choose which one gives you the best results

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange