Domanda

I am newbie in Weka 3.7.9. I have got an arff file, which contains these attributes, class and data: http://pastebin.com/s8hivv0U

This file representing Android projects. So, the 1-9. attributes are different kinds of metrics:

  1. lloc - Logical Lines Of Code
  2. nid - Number Of IDs
  3. nle - Nesting Level
  4. nel - Number of Elements
  5. nip - Number of Input elements
  6. activites - number of activities from AndroidManifest
  7. inside-permissions - number of inside permissions from AndroidManifest
  8. outside-permissions - number of outside permissions from AndroidManifest
  9. all-permissions - number of permissions from AndroidManifest
  10. class {4, 4.6, 3.8, 2.6. 5, 3.2, 3.6, 4.2, 4.1}

The last is a class which contains the Google Play rating of projects.

So each lines are Andorid projects. (Naturally, the original *.arff file contains more projects...)

I would like to analyze datas with learning algorithms. The predictors are from 1 to 9. I would like to determine, what predictors influence mostly the Google Play ratings.

How can i do that? And what is the best methods to do it? I would like ask you to explain it to me, if it is possible.

Thanks in advance, Peter

È stato utile?

Soluzione

Class type

First, I would suggest that you make you change your class type to numeric, if you would like your output to be continuous. Otherwise, I'd suggest keeping the class label type as nominal (as you have it now) but change your ratings to {1, 2, 3, 4, 5}.

If you change to a numeric output (so you can give a prediction of 4.5 stars, for example), then you'll need to use a classifier that is capable of a numeric class.

Using Weka

I'd suggest taking a look at the documentation to learn more about using Weka, possibly by going through some tutorials. For example, after double-clicking your ARFF-file, you should be doing most of your work in the Classify tab. Select a classifier, then choose Start.

Classifiers

Regression

Regression and in particular linear regression is nice because it is easy to interpret because it simply assigns a weight to each of your attributes and uses multiplication and addition of those weights to give an output.

I used your example file and tested it with LinearRegression, however with such few samples it determined that the best model was simply to output 3.9667 as the rating and that will give you a mean absolute error of 0.4722.

Not satisfied, next I tried SimpleLinearRegression, which gives a model -0.02 * activites + 4.13, and will give a mean absolute error of 0.472.

SMOreg gives the following model

weights:
 +       0.1147 * (normalized) lloc
 -       0.0404 * (normalized) nid
 -       0.1662 * (normalized) nle
 -       0.0647 * (normalized) nel
 +       0.3385 * (normalized) nip
 -       0.1352 * (normalized) activites
 -       0.019  * (normalized) inside-permissions
 -       0.0464 * (normalized) outside-permissions
 +       0.1602 * (normalized) all-permissions
 +       0.5921

and has a mean absolute error of 0.3859. But at this point, I think with such few data points you are overfitting your data.

Nearest neighbor

Using k-nearest neighbors might be a viable approach, if you have more data (in Weka it is called KStar).

Decision trees

The DecisionStump algorithm outputs this model with a mean absolute error of 0.3424, but again probably overfitting.

inside-permissions <= 1.5 : 2.6
inside-permissions > 1.5 : 4.090909090909091
inside-permissions is missing : 3.966666666666667

More data

As you can see, the models and error rates are not so great, considering you only have 12 data points. To build a really good model, you'll need more data. To get an accurate idea of how well the model is doing, you need to not only have enough data to train with, but enough data to keep as a separate test set that you only use for testing the performance of your model.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top