Does high error rate in regression imply the data set is unpredictable?

https://datascience.stackexchange.com/questions/4901

16-10-2019
|

Question

I have a data set of video watching records in a 3G network. In this data set, 2 different kind of features are included:

user-side information, e.g., age, gender, data plan and etc;
Video watching records of these users, each of which associated with a download ratio and some detailed network condition metrics, say, download speed, RTT, and something similar.

Under the scenario of internet streaming, a video is divided into several chunks and downloaded to end device one by one, so we have download ratio = download bytes / file size in bytes

Now, Given this data set, I want to predict the download ratio of each video.

Since it is a regression problem, so I use gradient boosting regression tree as model and run 10-fold cross validation.

However, I have tried different model parameter configurations and even different models (linear regression, decision regress tree), the best root-mean-square error I can get is 0.3790, which is quite high, because if I don't use any complex models and just use the mean value of known labels as prediction values, then I can still have an RMSE of 0.3890. There is not obvious difference.

For this problem, I have some questions:

Does this high error rate imply that the label in data set is unpredictable?
Apart from the feature problem, is there any other possibilities? If yes, how can I validate them?

Solution

It's a little hasty to make too many conclusions about your data based on what you presented here. At the end of the day, all the information you have right now is that "GBT did not work well for this prediction problem and this metric", summed up by a single RMSE comparison. This isn't very much information - it could be that this is a bad dataset for GBT and some other model would work, it could be that the label can't be predicted from these features with any model, or there could be some error in model setup/validation.

I'd recommend checking the following hypotheses:

1) Maybe, with your dataset size and the features you have, GBT isn't a very high-performance model. Try something completely different - maybe just a simple linear regression! Or a random forest. Or GBDT with very different parameter settings. Or something else. This will help you diagnose whether it's an issue with choice of models or with something else; if a few very different approaches give you roughly similar results, you'll know that it's not the model choice that is causing these results, and if one of those models behaves differently, then that gives you additional information to help diagnose the issue.

2) Maybe there's some issue with model setup and validation? I would recommend doing some exploration to get some intuition as to whether the RMSE you're getting is reasonable or whether you should expect better. Your post contained very little detail about what the data actually represents, what you know about the features and labels, etc. Perhaps you know those things but didn't include them here, but if not, you should go back and try to get additional understanding of the data before continuing. Look at some random data points, plot the columns against the target, look at the histograms of your features and labels, that sort of thing. There's no substitute for looking at the data.

3) Maybe there just aren't enough data points to justify complex models. When you have low numbers of data points (< 100), a simpler parametric model built with domain expertise and knowledge of what the features are may very well outperform a nonparametric model.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange