How can I improve my model on a very very small dataset?

https://datascience.stackexchange.com/questions/74118

11-12-2020
|

Question

I am starting as a PhD student and we want to find appropriate materials (with certain qualities) from basic chemical properties like charge, etc. There are a lot of models and datasets in similar works, but since our work is pretty novel, we have to make and test each data sample ourselves. This makes the data acquisition very very slow and very expensive. Our estimated samples will be 10-15 samples for some time, until we can expand it.

Now I want to use this samples to make a basic predictive model, but with as much 'good generalization' as possible. I will use this model to screen other possible candidates from a large pool of properties to find best probable materials, and will then proceed to make them for testing.

Now I clearly don't expect performances anywhere near 95% or so, but I want a working model with predictive capability that will actually help me to find some of the best probable material candidates, so we can expand our work. I am uncertain if I can (or rather should) use some of regular ML methods like dataset splitting and cross-validation. So I would appreciate your thoughts.

Since our data size is minuscule, I have been searching for ways to improve its robustness. These our my ideas:

1- Use an ensemble model to avoid overfitting and avoid skewed biases (using algos like elasticnet, SVM, random forests, etc).

2- Setting heavy regularization to avoid certain biases that can arise from small data.

3- Using algos that arrive at the minimum periphery faster.

I would appreciate any suggestions on how I can improve this model as much as possible, to reach the best generalization performance.

I have also thought about synthetic data generation a lot. Do you have any suggestions on how I can go about it?

Solution

From what you say, I think you should start with checking three options:

I) Ordinary least squares (OLS): Just run a „normal“ linear regression. This will not yield great predictions, but you could view the model as a causal one, if you can assume a linear relation between $y$ and $x$. When you have five predictors and 35 observations, you have a total of 29 degrees of freedom which is „okay“. When you estimate the model in „levels“, so just values as they are, you can directly interprete the estimated coefficients as marginal effects. E.g. a model $y=\beta_0+\beta_1 x + u$, tells you that when $x$ increases by one unit, $y$ changes by $\beta_1$ units, just like a linear function.

II) You can use Lasso/Ridge/Elastic Net: All of them are linear-like models with a penalty term to „shrink“ $x$ variables if they are „not useful“. This works like automatic feature selection if you like to say so. There is a great package by Hastie et al. for R. You can find it here. It is also available for Python.

III) Maybe (!) boosting could be an option as well: You would (likely) need to do some feature selection/engineering on your own. But Boosting is able to work with a small number of observations, with highly correlated features, and it often works well with highly non-linear problems. There are LightGBM or Catboost as possible Python packages. Find some minimal examples here.

With II) and III) you will find that you are not really able to „set aside“ a number of observations to check if your models work (because you don’t have much data). You could use cross validation (Ch. 5 in ISL, link below), but you need to see how it works. Instead of going for a predictive model, I tend to say that you might be better off starting with a „causal-like“ OLS model. With OLS you do not really need a „test-set“. OLS is very robust.

Since you seem to be new to statistical modeling, you might benefit from having a look at „Introduction to Statistical Learning“ (Chapters 3 and 6 in particular). The PDF is online and there is code for the Labs in Python and R. The advanced book would be „Elements of Statistical Learning“.

Good luck with your project!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange