Question

I'm trying to generate some synthetic data for experiments. When it comes to data sets with numerical features this is rather easy, I just use a Gaussian mixture (using Netlab, a package for Matlab) and that's done.

Noooww, I also need to generate some data sets with numerical and categorical features. The numerical part I can easily do using the above method, what about the categorical?

I was thinking to generate a categorical feature with (say) 3 categories with probabilities of 68.2% (+/- 1 sigma), 27.2% (between +/- 1 sigma and +/- 2 sigma), and 4.6% (the rest) within the objects with the same label.

And perhaps another categorical feature with 5 categories, with probabilities of 34.1%, 34.1%, 13.6%, 13.6%, 4.6% - again, within the objects with the same label.

Does that make sense to you guys? any thoughts?

I can easily write the code for the above, but if you know of any function that does it for me - please let me know.

Thanks!

No correct solution

OTHER TIPS

It's easy to do in Python using numpy:

import numpy as np
np.random.multinomial(n=1, pvals=[.3,.3,.4], size=10)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top