Question

I have a large dataset with attributes in a tabular form as below

userid movieid rating

2         34    5
4         11    3

I need to input these values to the data section of the ARFF file in order to analyse it with the weka software for machine learning. But the normal format which arff supports is as follows

  5.1,3.5,1.4,0.2,Iris-setosa
   4.9,3.0,1.4,0.2,Iris-setosa
   4.7,3.2,1.3,0.2,Iris-setosa
   4.6,3.1,1.5,0.2,Iris-setosa

Attributes are comma separated. Does arff need the comma always or is it ok to separate it with spaces or tabs?

Was it helpful?

Solution

Attribute values for each instance of the data section are always delimited by commas (ARFF developer version):

Each instance is represented on a single line, with carriage returns denoting the end of the instance. A percent sign (%) introduces a comment, which continues to the end of the line.

Attribute values for each instance are delimited by commas. A comma may be followed by zero or more spaces. Attribute values must appear in the order in which they were declared in the header section (i.e., the data corresponding to the nth @attribute declaration is always the nth field of the attribute).

A missing value is represented by a single question mark

In a similar situation I found weka-convert (a Python command line utility) very useful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top