Question

I am trying to generate an .arff file from a csv data file I have. Now I am totally new to Weka and have started using it just a day back. I am trying out a simple twitter sentiment analysis with this for starters. I have generated training data in CSV. Contents of CSV file are as follows:

  tweet,affinScore,polarity
 ATAUTHORcfoblog is giving away a $25 Amex gift card (enter to win over $600 in prizes!) http://t.co/JD8EP14c ,4,4
"American Express has always been my dark horse acquirer of  ATAUTHORFoursquare. Bundle in Square-like payments & its a lite-retailer platform, no? ",0,1
African-American Demos Express Ethnic Identity Differently http://t.co/gInv4bKj via  ATAUTHORmediapost ,0,3
Google ???????? Visa ? American Express  http://t.co/eEZTSiHY ,0,4
Secrets to Success from Small-Business Owners : Lifestyle :: American Express OPEN Forum http://t.co/b85F8JX0 via  ATAUTHOROpenForum ,2,1
RT  ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of  ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1
Winning Surveys $1500 american express Huggies Sweeps http://t.co/WoaTFowp ,4,1
I root for Square mostly because a small business that takes Square is also one that takes American Express. ,0,1
I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Uh oh... RT  ATAUTHORBlackArrowBella: I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Just got another credit card. A Blue Sky card with American Express. Its gonna help pay for the honeymoon!  ATAUTHORAmericanExpress ,-1,1
Follow  ATAUTHORShaveMagazine and ReTweet this msg to be entered to #Win an American Express Gift card. Winners contacted bi-weekly by direct msg! ,2,4
American Express Gold zakelijk aanvragen: http://t.co/xheZwmbt ,0,3
RT  ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of  ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1

Here first attribute is actual tweet, second is AFFIN score and third is actual classification class (1- Positive, 2-Negative, 3-Neutral, 4-Spam)

Now I try to generate .arff format from it using code:

import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;

import java.io.File;

public class CSV2Arff {
  /**
   * takes 2 arguments:
   * - CSV input file
   * - ARFF output file
   */
  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
      System.exit(1);
    }

    // load CSV
    CSVLoader loader = new CSVLoader();
    loader.setSource(new File(args[0]));
    Instances data = loader.getDataSet();

    // save ARFF
    ArffSaver saver = new ArffSaver();
    saver.setInstances(data);
    saver.setFile(new File(args[1]));
    saver.setDestination(new File(args[1]));
    saver.writeBatch();
  }
}

This generates .arff file that looks somewhat like:

   @relation file

@attribute tweet {_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._

@data
_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,4,4
'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',0,1
African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,0,3
Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,0,4
Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,2,1
RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._,0,1

I am new to Weka but from what I have read, I have a suspicion that this ARFF is not correctly formed. Can anyone comment on it?

Also if it is wrong, can someone point me to where exactly am I going wrong?

Was it helpful?

Solution

Make sure to set the type of the tweet attribute to be arbitrary strings, not a categorial attribute, which seems to be the default. This doesn't scale well, as it puts a copy of every tweet in the type definition otherwise.

Note that for actual analysis of the tweet contents, you will likely need to preprocess them much further. You will likely want a sparse vector representation of the text instead of a long string.

OTHER TIPS

If you are using the UI as previously mentioned then you can just load the file directly into Weka.

If you just want to generate an ARFF file based on the CSV file you can do the following. This was taken from the CSV2Arff tool which comes as part of Weka.

import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import java.io.File;

public class CSV2Arff {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
  System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
  System.exit(1);
}

// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();

// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
}
} 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top