Converting separate text files containing training data and its labels to ARFF format

https://stackoverflow.com/questions/22767665

24-06-2023
|

Question

I need to perform a classification task in Weka for a dataset. The dataset contains of 3 text files training.txt, label_training.txt and testing.txt. The format of training.txt and testing.txt is as follows:

InformationID  FeatureID  Value
1                6         1.00
1               160       31.00
1               438        1.00
1               479        1.00
2              6457        2.00
2              6664        0.65
2              6761        0.46
2              6762        1.00

The label_training.txt contains the class labels for the training data and each row represents a data point in the training set.

Does this mean row 1 of label_training.txt file corresponds to all rows in the training.txt file which has InformationID 1? I would like to make sure if I am understanding it right. So one data point in the training set corresponds to InformationID 1 with values of 4 features with ID 6,160,438,479?

Now, how do I create an ARFF file which combines the training data and the labels for it to derive a classifier? Any help would be appreciated.

Solution

Well, it seems that your dataset is in an sparse format in which InformationID identifies the instance, and FeatureID identifies the feature, being Value the value for each couple of instance/feature.

Lets us assume that the label_training.txt is explicit (an instance is identified by the line, for example line #1 identifies instance #1, which corresponds to InformationID 1.

In this case, you need to generate ARFF files like the following one:

@relation my-relation

@attribute my-class {-1,1}
@attribute 1 numeric
@attribute 2 numeric
../..

@data
{0 1, 6 1.00, 160 31.00, 438 1.00, 479 1.00}
{0 1, 6457 2.00, 6664 0.65, 6761 0.46, 6762 1.00}
../..

This a WEKA sparse ARFF format in which each couple of numbers correspond to an attribute number and to its value.

I suggest to write an script to perform this transformation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow