WEKA - Vector Attribute in arff format

https://stackoverflow.com/questions/11137479

16-06-2021
|

Question

I am new to Weka and I am trying to build a classifier to classify EEG data. The EEG attribute data is 5 minutes of recorded raw signals as well as other attributes. How can I specify in WEKA arff file format that my instance has a vector input of a 5 minute raw signal?

for example:

Num. -- raw -- class
1    -- [1,2,3,4,5,6] -- Relaxed
2    -- [2,3,4,5,6] --- Bored

Where raw is an attribute vector..

Solution

Think about your problem- what are you trying to classify/predict, and how can it be best represented. Chances are that you don't want to predict the next raw EEG reading, so a time-series approach probably isn't critical.

Weka can only handle instances (rows of data) with a fixed set of attributes (features, values, or in other words, a vector of a predefined length). The possible types of attributes one can have are nominal (e.g. "red","green","blue"), numeric (any integer/floating point value), string (mostly for text mining). and date. There is no way to represent a vector of raw signal as a single attribute. Here is the documentation: http://weka.wikispaces.com/ARFF+%28stable+version%29

That said, your instances could look like this:

num,class1,reading_1,reading_2,reading_3 ... reading_n,relaxed,bored

where reading_1 is the first raw reading and reading_n is the last one at the end of 5 minutes. This would be asking WEKA to predict your class based on the raw readings, and probably won't be very effective (because the readings may not line up with each other, and because this treats each reading separately, with no care for things like frequency or average which are relative).

Alternatively, you can do some pre-processing of the raw data so that it is useful for most machine learning algorithms in WEKA. In this case, you would need to decide on important features and then create them. A crude example could be:

num,class1,average,frequency,max_magnitude,standard_deviation,relaxed,bored

Where you have calculated things like average and frequency of the data before putting it into an ARFF file. Then the algorithms have a much more informative picture of the dataset on which to base their predictions.

However, still another concern is what are you representing? Is the entire 5 minute sample the same class, or is the user relaxed for part of it and bored for part of it? If this is the case, you should probably have two samples: one for when the user is bored and one for when she is relaxed.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow