Sampling Data into two Groups

https://stackoverflow.com/questions/17766502

03-06-2022
|

Domanda

I am seeking help to make the code below efficient. I not satisfied though it works. There is bug to be fixed (currently irrelevant). I am using < random> header for the first time and stable_partition for first time.

The Problem definition/specification:
I have a population (vector) of numerical data (float values). I want to create two RANDOM samples (2 vectors) based on a user specified percentage. i.e. popu_data = 30%Sample1 + 70%Sample2 - here 30% will be given by the user. I didnt implement as % yet but its trivial.

The Problem in Programming: I am able to create the 30% Sample from the population. The 2nd part of creating another vector (sample2 - 70%) is my problem. The reason being while selecting the 30% data, I have to select the values randomly. I have to keep track of the indexes to remove them. But some how I am not getting an efficient logic than the one I implemented.

My Logic is (NOT happy): In the population data, the values at random indexes are replaced with a unique value (here it is 0.5555). Later I learnt about stable_partition function where individual values of the Population are compared with 0.5555. On false, that data is created as a new Sample2 which complements sample1.

Further to this: How can I make this Generic i.e. a population into N sub-samples of user defined % of population.

Thank you for any help. I tried vector erase, remove, copy etc but it didn't materialize as the current code. I am looking for a better and more efficient logic and stl usage.

#include <random>
#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

bool Is05555 (float i){
    if ( i > 0.5560 ) return true;
    return false;
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    uniform_real_distribution<> dis(1, 2);
    vector<float>randVals;

    cout<<"All the Random Values between 1 and 2"<<endl;
    for (int n = 0; n < 20; ++n) {
        float rnv = dis(gen);
        cout<<rnv<<endl;
        randVals.push_back(rnv);
    }
    cout << '\n';

    random_device rd2;
    mt19937 gen2(rd2());
    uniform_int_distribution<int> dist(0,19);

    vector<float>sample;
    vector<float>sample2;
    for (int n = 0; n < 6; ++n) {
        float rnv = dist(gen2);
        sample.push_back(randVals.at(rnv));
        randVals.at(rnv) = 0.5555;
    }

    cout<<"Random Values between 1 and 2 with 0.5555 a Unique VAlue"<<endl;
    for (int n = 0; n < 20; ++n) {
        cout<<randVals.at(n)<<" ";
    }
    cout << '\n';

    std::vector<float>::iterator bound;
    bound = std::stable_partition (randVals.begin(), randVals.end(), Is05555);

    for (std::vector<float>::iterator it=randVals.begin(); it!=bound; ++it)
        sample2.push_back(*it);

    cout<<sample.size()<<","<<sample2.size()<<endl;

    cout<<"Random Values between 1 and 2 Subset of 6 only: "<<endl;

    for (int n = 0; n < sample.size(); ++n) {
        cout<<sample.at(n)<<" ";
    }
    cout << '\n';

    cout<<"Random Values between 1 and 2 - Remaining: "<<endl;
    for (int n = 0; n < sample2.size(); ++n) {
        cout<<sample2.at(n)<<" ";
    }
    cout << '\n';

    return 0;
}

Soluzione

Given a requirement for an N% sample, with order irrelevant, it's probably easiest to just do something like:

std::random_shuffle(randVals.begin(), randVals.end());
int num = randVals.size() * percent / 100.0;

auto pos = randVals.begin() + randVals.size() - num;

// get our sample
auto sample1{pos, randVals.end()};

// remove sample from original collection
randVals.erase(pos, randVals.end());

For some types of items in the array, you could improve this by moving items from the original array to the sample array, but for simple types like float or double, that won't accomplish anything.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow