How to Present All Categories in All Samples

https://datascience.stackexchange.com/questions/86038

17-12-2020
|

Domanda

I have a data contains many categorical columns. When I sampled this data randomly a few times and applied one-hot encoding to categorical columns I noticed that it ended up with datasets with different column counts. Because not all categories in columns preserved in samples and different samples includes different subset of categories for each column. Is there a way to ensure all categorical columns in all samples contains all possible categories?

Soluzione

The first thing we must accept that the sampling is probably doing the right job.
What I mean is that if only 10% is being sampled then some unique value which is less than 5 can be easily missed.
Ideally, you should club these values into some generic value i.e. OTHER_COL_1

But, if you want to get away with this natural result, you should apply some tweaking.

We may do the following -

Get the sample as you are doing now
Match the unique element of each column to the unique from the main data
Iterate on each col and missed unique value
Let's assume UNIQUE_4 is missed for COL_2
Sample all the records for UNIQUE_4 from COL_2 of main data and
Pick one random data out of it

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange