Running StringToWordVector filter in WEKA

https://stackoverflow.com/questions/15245766

18-03-2022
|

Question

I am using WEKA's API on Java to develop an application. When running StringToWordVector filter (to convert the string type attributes) to my input.arff file that looks like this:

    @relation Instantzien_Bektorea

    @attribute 5_Ainf_Lema string
    @attribute 6_Arg_PosKat {IZE,ADJ,ADI,ADB,DET,IOR,LOT,PRT,ITJ,BST,ADL,ADT,SIG,SNB,LAB,POST}
    @attribute 7_Arg_Pos_AzpiKat {ARR,IZB,LIB,ZKI,GAL,SIN,ADK,ADP,FAK,ERKARR,ERKIND,NOLARR,NOLGAL,DZH,BAN,ORD,DZG,ORO,PERARR,PERIND,IZGMGB,IZGGAL,BIH,ELK,JOK,JNT,HUTSA}
    @attribute 8_Arg_Kasua {abl,abu,abz,ala,soz,dat,des,erg,gel,gen,ine,ins,mot,abs,par,pro,bnk,desk,aurk,bald,emen,erlt,espl,haut,helb,kaus,konpl,kont,denb,mod,mos,ondo,zhg,neg,gen_post_ine,gen_post,gen_post_abs,ala_des,soz_post_ala,zero_post_abl,-}
    @attribute 9_Argumentuaren_FSint {-,subj,obj}
    @attribute 10_Arg_Posizioa {Aurretik,Atzetik}
    @attribute 11_Dist_HKop numeric
    @attribute 12_Dist_ArgKop numeric
    @attribute 13_Framea string
    @attribute 15_Frame_Unekoa string
    @attribute Klasea {arg0,arg1,arg2,argM*LOC,argM*TMP,argM*MNR,argM*Cause,argM*ADV,argM*PRP,argM*-,argM*NEG,argM*DIS}

    @data
    eta_gero,LOT,ARR,denb,-,Aurretik,999,1,argM_PRED_arg1,ARGM_PRED_arg1,argM*TMP
    Ainf_Lema,ADI,SIN,mod,-,Aurretik,1,1,argM_arg0_arg1_PRED,argM_arg0_ARG1_PRED,arg1
    Ainf_Lema,IZE,ARR,abs,subj,Aurretik,999,2,arg0_argM_arg1_PRED,ARG0_argM_arg1_PRED,arg0
...

I get another a bunch of instances that wrote in output.arrf look like this:

@relation 'Train_Instantzien_Bektorea-weka.filters.unsupervised.attribute.StringToWordVector-R1,9,10-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute 6_Arg_PosKat {IZE,ADJ,ADI,ADB,DET,IOR,LOT,PRT,ITJ,BST,ADL,ADT,SIG,SNB,LAB,POST}
@attribute 7_Arg_Pos_AzpiKat {ARR,IZB,LIB,ZKI,GAL,SIN,ADK,ADP,FAK,ERKARR,ERKIND,NOLARR,NOLGAL,DZH,BAN,ORD,DZG,ORO,PERARR,PERIND,IZGMGB,IZGGAL,BIH,ELK,JOK,JNT,HUTSA}
@attribute 8_Arg_Kasua {abl,abu,abz,ala,soz,dat,des,erg,gel,gen,ine,ins,mot,abs,par,pro,bnk,desk,aurk,bald,emen,erlt,espl,haut,helb,kaus,konpl,kont,denb,mod,mos,ondo,zhg,neg,gen_post_ine,gen_post,gen_post_abs,ala_des,soz_post_ala,zero_post_abl,-}
@attribute 9_Argumentuaren_FSint {-,subj,obj}
@attribute 10_Arg_Posizioa {Aurretik,Atzetik}
@attribute 11_Dist_HKop numeric
@attribute 12_Dist_ArgKop numeric
@attribute Klasea {arg0,arg1,arg2,argM*LOC,argM*TMP,argM*MNR,argM*Cause,argM*ADV,argM*PRP,argM*-,argM*NEG,argM*DIS}
@attribute ARG0_PRED_arg1 numeric
@attribute ARG0_arg1_PRED numeric
@attribute ARG0_arg1_PRED_arg1_argM numeric
@attribute ARG0_arg1_PRED_argM numeric
@attribute ARG0_arg1_PRED_argM_argM numeric
@attribute ARG0_argM_PRED numeric

...

@attribute argM_PRED_ARG1_argM_argM numeric
@attribute argM_PRED numeric
@attribute argM_PRED_arg1_ARGM numeric

...

@attribute ARGM_argM_PRED_arg1_argM numeric
@attribute arg0_ARGM_arg1_PRED numeric
@attribute arg0_ARGM_arg1_PRED_argM numeric
@attribute arg0_arg1_PRED_ARGM_argM numeric
@attribute eta_gero numeric
@attribute gaur numeric


@data
{0 LOT,2 denb,5 999,6 1,7 argM*TMP,90 1,162 1,197 1}
{0 ADI,1 SIN,2 mod,5 1,6 1,7 arg1,19 1,42 1,93 1}
{2 abs,3 subj,5 999,6 2,16 1,19 1,29 1}

As you will see in the output.arff file some attributes disappear from the instances (first instance-> no first attribute, no third attribute etc.) Why is this??

The Java code that runs the filter looks like this:

      // StringToWordVector filter
      String[] options = new String[1];
      options[0] = "-R <1,9,10>";                                    
      StringToWordVector filter = new StringToWordVector(); 
      filter.setOptions(options);                          
      filter.setInputFormat(input.arff);                         
      Instances output_inst = Filter.useFilter(input_inst, filter);

Any ideas where the problem can be? Thank you very much.

Solution

First of all, your input file is in the normal ARFF format, whereas the output file is in sparse ARFF, as they start with { and end with }. (See information on the Attribute-Relation File Format)

{0 LOT,2 denb,5 999,6 1,7 argM*TMP,90 1,162 1,197 1}

In this sparse format, attributes having the value 0 will be omitted. All present attributes need to be specified by their index followed by the value. In the above example (your first instance):

Attribute 0 => LOT
Attribute 1 omitted => 0
Attribute 2 => denb
Attribute 3 omitted => 0
...

If you look at the definition of attribute 1, you'll see that it's not numerical but nominal, so 0 is the index of it's value, in this case ARR

So, there are no attributes missing, they are just omitted in the output because it is in the sparse format.

In case you're wondering why you have different attributes: That is the result of the StringToWordVector filter.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow