Question

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.

My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.

The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:

Distance  Observations
0          330
1          1903
2          12210
3          35486
4          54640
5          62193
6          60728
7          47874
8          33666
9          21640
10         12535
11         6592
12         3159
13         1157
14         349
15         86
16         12
Was it helpful?

Solution

The first thing I would try here is building a regression model of the log of the distance, since this will concentrate the range of larger distances. If you're using a generalised linear model this is the log link function; for other methods you could just manually do this by estimating a regression function of your inputs, x, and exponentiating the result:

y = exp( f(x) )

remember to use the log of the distance for a pair to train with.

OTHER TIPS

Popular techniques for dealing with imbalanced distribution in regression include:

  • Random over/under-sampling.
  • Synthetic Minority Oversampling Technique for Regression (SMOTER). Which has an R package to implement.
  • The Weighted Relevance-based Combination Strategy (WERCS). Which has a GitHub repository of R codes to implement it.

PS: The table you show seems like you have a classification problem and not a regression problem.

As previously mentioned, I think what might help you given your problem is Synthetic Minority Over-Sampling Technique for Regression (SMOTER).

If you're a Python user, I'm currently working to improve my implementation of the SMOGN algorithm, a variant of SMOTER. https://github.com/nickkunz/smogn

Also, there are a few examples on Kaggle that have applied SMOGN to improve their prediction results. https://www.kaggle.com/aleksandradeis/regression-addressing-extreme-rare-cases

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top