Question

I have high dimensional (200 dimensions) vectors that I want to cluster using Weka. How should I represent it in ARFF format?

The data is something like this (with dim1, dim2 etc. being real numbers):

vector_label dim1 dim2 dim3 ...... dim200

The link here - http://weka.wikispaces.com/ARFF+%28stable+version%29 tells me that I should be representing it as follows :

@RELATION vectors
@ATTRIBUTE vector_label STRING
@ATTRIBUTE dim1 NUMERIC
@ATTRIBUTE dim2 NUMERIC
@ATTRIBUTE dim3 NUMERIC
....
@ATTRIBUTE dim200 NUMERIC

@DATA
vector1,0.1,0.2,-2.1, ...... ,-0.1

and so on.

Is this correct? The reason I'm asking is that the link doesn't really say anything clearly about high dimensional vectors, but I feel there may be a better way of representation for them that I don't know about.

Was it helpful?

Solution

That representation is correct. There is not particular difference in ARFF representations when you have more or less dimensions.

However, if the vectors are sparse (most dimension values are zero in most of the vectors, you may want to make use of an Sparse ARFF representation, that is much more compact and saves disk space and memory.

OTHER TIPS

Your example correct if your data is not sparse. If your data is sparse use sparse arff file format. An example can be found here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top