Unsupervised classification - feature vectors are obtained

https://stackoverflow.com/questions/10035517

29-05-2021
|

Question

I need to classify commercial products. You know what price comparison engines does.

We have obtained the feature vectors. They are not the best yet pretty good. My last step is classifying them without knowing how many clusters there are. So algorithms like k-means won't work since they require how many classes there are.

So here example set of feature vectors. They are in order here (as an example) but i need an algorithm which does not depend on any order.

#################################################
47 - ddr2;asus;1066;g41;am;p5qpl;775;
48 - g41;p5qpl;asus;am;ddr2;vga;anakart;
49 - intel;anakart;ddr2;1066;p5qpl;asus;am;
50 - p5qpl;ddr2;asus;am;g41;vga;anakart;
51 - ddr2;asus;1066;g41;am;p5qpl;775;
52 - g41;p5qpl;1066;am;ddr2;asus;anakart;
53 - p5qpl;ddr2;1066;am;g41;asus;sata;
54 - g41;p5qpl;1066;am;asus;ddr2;sata;
###################################################
55 - engtx480;asus;384bit;2di;gddr5;vga;16x;
56 - 2di;karti;384bit;asus;engtx480;ekran;pci;
57 - asus;engtx480;2di;vga;gddr5;384bit;16x;
58 - 2di;karti;engtx480;384bit;asus;gddr5;1536mb;
59 - engtx480;asus;384bit;2di;gddr5;vga;16x;
60 - engtx480;asus;384bit;2di;gddr5;vga;16x;
####################################################
61 - ray;blu;ihbs112;siyah;bulk;dvd;sata;
62 - ihbs112;ray;blu;on;lite;yazici;kutusuz;
63 - ihbs112;blu;ray;lite;on;siyah;bulk;
64 - blu;ihbs112;ray;lite;on;siyah;yazici;
65 - liteon;ihbs112;bd;yazma;hizi;12x;max;
66 - ihbs112;ray;blu;on;lite;bulk;dvd;
67 - etau108;dvd;siyah;lite;on;rw;ihbs112;
68 - ihbs112;liteon;bd;yazma;hizi;12x;max;
69 - ihbs112;ray;blu;lite;on;siyah;bulk;
#####################################################

When a human look it is easy to classify products with just using these feature vectors. But i need to achieve it via an algorithm. And also i need to achieve it with an algorithm which does not requires any prior information just uses feature vectors.

From the above feature vector set the 47-54 is a cluster , 55-60 another cluster and 61-69 another cluster (each cluster means a commercial product in real life). So the algorithm need to classify these correctly with just using these kind of feature vectors.

The algorithm can not be depended on the line order of the feature vectors or how many classes there will be. We don't know anything and we just have feature vectors.

Waiting your suggestions about this classification problem. Thank you.

La solution

Adaptive Resonance Theory is the short answer to your question. Unlike KMeans you dont need to set the number of clusters in advance. The input is a set of feature vectors either binary (ART 1 Algorithm) or continuous (ART -2A, ARTMAP etc.) and the output is classification of documents in clusters.

Autres conseils

I can identify 3 main issues which need to be adressed.

1) From the examples you provided, it seems that all your vectors have a dimension of 7. If this should NOT be the case, you can use PCA to reduce the (unknown but bounded) number of dimensions to a fixed size. This ensures that you can use a clustering algorithm without heavy modification.

2) To overcome the fact that you don't know the cluster size, you can use DBSCAN. It requires two parameters: The minimum cluster size and the neighborhood size.

3) You need a representation space with the dimensionality of step 1) on which the clustering algorithm can operate. For this you have to think of a way to construct feature vectors from these samples. From the examples you showed, it seems that the training vectors are not arbitrary filled in terms of symbolity. It seems to me that despite what you're saying you might be able to use heuristics. However, if that's not possible just chose a numeric representation as feature values.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow