Question

I want to read a dataset, from UCI with Amazon reviews, using R.

The dataset comes in the format ARFF (.arff).

I am using the following script:

require("foreign")
setwd("H:/DataSet/amazon")
reviews <- read.arff("amazon.arff")

And I am getting the following error

Error in read.arff("amazon.arff") : Invalid attribute specification.

Thank you for your help in advanced.

Was it helpful?

Solution

I assume you mean the "Amazon Commerce reviews set Data Set" at the UCI Machine Learning Repository. Even Weka cannot open this dataset, saying

"...not recognized as an 'Arff data files' file. ... Attribute names are not unique."

and if you look into the file you see lots of entries similar to

@attribute '\'\'\'\'\'\'\'\'\'\'r\'\'\'\'\'\'\'\'\'\'\'' numeric

So something went wrong with the file, it's not the fault of R or any 'Arff' reading routines. You should ask the dataset creator whose name and e-mail address is provided on the description page.

OTHER TIPS

I found the solution for getting Weka to open the .arff file.

Unable to determine structure as arff (Reason: java.lang.illegalArgumentException: Attribute names are not unique! Causes: 'T' 'T' 'T' 'T' 'I' 'I' 'I' 'I' 'Th' 'Th' 'Th' 'class').

It's not attributes like this @attribute "'\'\'\'\'\'\'\'\'\'\'r\'\'\'\'\'\'\'\'\'\'\'' numeric".

If you open the arff file in a text editor (I used TextMate) you will find the culprits. (in text mate they show < NUL > )

  • @attribute g_b numeric @attribute T numeric @attribute eing numeric @attribute T numeric @attribute rne numeric @attribute T numeric @attribute T numeric

You could use control F to search through then attributes for 'I' 'T' and 'Th' But to speed up the search here are 3 easy-to-search attributes that are close to the problem sites.

for 'I' search for 't_wo',

for 'Th' search for 'ff_'

for 'T' search for 'x_' (attributes will be above for this one)

You can't simply remove them because there's no way to know which numbers apply so I suggest renaming them to T2-4, I2-4 Th2-4. You also need to rename the attribute 'class' to 'class1'

In your particular case the dataset has some problems, I was not able to read it.

Not sure if it helps, but if you want to read .arff files using R, another way that you can do is by using the RWeka package.

The package has some dependencies: rJava (Note 1) and RWekajars.

Then, by using the following script you will be able to read the dataset (Note 2):

library(rJava)
library("RWeka")
x <- read.arff(file= "amazon.arff")

I haven't tried specifically with your dataset (due to the problems of it), but when using iris.arff dataset, the script it is working fine (of course, one needs to change the name of the file).


Notes

  1. If you happen to have any errors with the rJava package, this answer that I gave on another question may help you.
  2. Make sure that you are running the script in the folder that you have the file. One way of doing it is by creating a new project in RStudio, keep the dataset in the project's directory and then run the scripts that you want.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top