Question

I use RapidMiner and i have a data set which contains 40 lines, each line has 14 column. Lines are different kinds of metrics of Android applications + and the end of the line there is google-play ranking (first line is the header which contains the name of metrics).

(So the goal is predict google play ranking from metrics.)

The data set: http://pastebin.com/Cw1BR4K6

  • column 1-13: different kind of metrics
  • column 14: google play ranking
  • line 2-40: metrics of Android projects

I used PolynomialRegression in RapidMiner and i got this result:

- 6.723 * lloc ^ 1.000
+ 1.187 * nid ^ 2.000
- 47.730 * nle ^ 1.000
- 36.433 * nel ^ 1.000
- 1.466 * nip ^ 2.000
- 97.187 * activites ^ 1.000
- 50.080 * inside-permissions ^ 1.000
- 60.291 * outside-permissions ^ 1.000
- 52.472 * all-permissions ^ 4.000
- 2.309 * jtlloc ^ 1.000
+ 36.058 * jtnm ^ 1.000
+ 9.924 * jtna ^ 1.000
+ 40.504 * jtncl ^ 1.000
+ 9.455

My question: How can i check that this result is correct? How can i check this result to an already available line? For example i would like to apply this result to the line 25: 25,8,5,10,0,1,0,0,0,239,10,14,4,3.8

My other question: What are the methods which i can do predicts about this set? And what is the best methods to do it? I would like ask you to explain it to me, if it is possible.

Thanks in advance, Peter

Was it helpful?

Solution

The result of the polynomial regression is a trained model. If you want to apply the model to a data set and see the results, use the Apply Model operator. It takes two inputs: the model and the data. The output of this operator is dataset with one more attribute: the regression result.

But evaluating performance of a model using the same data as it was trained on is a very bad idea.(overfitting). To correctly evaluate the model's performance, split the data to training set(used for training the model) and testing set(used to evaluate performance). Or use cross-validation which is in fact the same, but done multiple times and averaged. (in Rapidminer : Edit -> New Building Block -> Numerical X-Validation)

Which regression method to choose is a difficult problem and depends on your specific needs. Is your only criterion the regression error? Do you need human readable output? You will surely need to experiment with multiple methods. And I'm not sure you will get some conclusive results with this small dataset.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top