Domanda

I have a data contains many categorical columns. When I sampled this data randomly a few times and applied one-hot encoding to categorical columns I noticed that it ended up with datasets with different column counts. Because not all categories in columns preserved in samples and different samples includes different subset of categories for each column. Is there a way to ensure all categorical columns in all samples contains all possible categories?

È stato utile?

Soluzione

The first thing we must accept that the sampling is probably doing the right job.
What I mean is that if only 10% is being sampled then some unique value which is less than 5 can be easily missed.
Ideally, you should club these values into some generic value i.e. OTHER_COL_1

But, if you want to get away with this natural result, you should apply some tweaking.

We may do the following -

  • Get the sample as you are doing now
  • Match the unique element of each column to the unique from the main data
  • Iterate on each col and missed unique value
  • Let's assume UNIQUE_4 is missed for COL_2
  • Sample all the records for UNIQUE_4 from COL_2 of main data and
  • Pick one random data out of it
Autorizzato sotto: CC-BY-SA insieme a attribuzione
scroll top