How to adaptively sample n-dimensional data and build an optimum training set

https://datascience.stackexchange.com/questions/68963

09-12-2020
|

Question

My input space is at least 10 dimensional (after reducing it by various component analysis such as PCA) and the output space is 4 dimensional. I am building a neural network that works something like a function approximator which takes the above 10D data as input (10 neurons in input) and spits out 4D data as output (4 neurons at the output). In between there are hidden layers.

I need to build a good training set that covers all the possible values of input and output. Although it may seem like there is a very large combination possible for 10D input + 4D output, in reality, the input/output is restricted by real world constraints. The problem I'm facing is this: In some part of the training sets, I need to sample the data with higher resolution, in other parts, I can go away by sampling with lower resolution. I can obviously sample the entire dataset with high resolution, but then the number of samples is more than 10 trillion and I know that most part of the dataset is pretty slowly varying and sampling this part with lower resolution is enough. My guess-estimate is, if I could sample it properly, I could get away with 4-5 orders of magnitude less numbers of sample.

My questions is, what should I do/build in order to have something that will adaptively sample a high dimensional dataset (change resolution whenever necessary) and give me an optimum training set?

NB: one of my colleagues advised me to use Markov Chain Monte Carlo method. I am not sure whether it is the best way to go. Please share your opinions.

Solution

Take a look at active learning.

https://en.m.wikipedia.org/wiki/Active_learning_(machine_learning)

In this approach you can train with a subset of samples and add more depending on what samples are most informative for the model.

Alternatively you could try reducing the number of samples by clustering and choosing a small number of samples from each cluster.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange