Question

I need a review for my solution for sampling 100 random rows from a table stored on a MPP machine (currently Netezza , later might be hadoop/etc.)

I'm not interested in using Netezza's rand(), as I would like to be able to reproduce the same sample later, and I'm not counting on setseed().

The solution I'm using now is:

SELECT * FROM MY_TABLE ORDER BY ID % 371, ID % 17, ID % 501, ID LIMIT 100

where the 3 numbers are primes I'm generating by my own RNG. Am I on the right track? Is this "random" sample random enough?

NOTE: I don't need it to be a crypto strong random sample, I just want to make sure I'm picking a different sample each time, and sampling uniformly, and I would like to easily be able to reproduce my sample (by executing the same SQL) if needed.

Thanks!

Was it helpful?

Solution

I would not try to generate random numbers in SQL when it is so easy to get some real random numbers, stick them in a table or in your stored procedure.

Go to https://www.fourmilab.ch/hotbits/secure_generate.html to get 256 random bits in hex, group the digits 4 at a time, convert to decimal integers and then use the first 100 numbers as your keys.

You will have a repeatable sequence and you can make as many truly random sequences as you need.

If you need repeatable pseudorandom numbers with a uniform distribution, the same principle applies. Just use a different source for the numbers. For instance you could write a Python script that uses the included random module to generate your numbers.

As for the SQL to use, if you structure your DB properly it is straightforward. In this case, since I build the random table once, and use it for queries more than once, I would build the table to contain primary keys, not just the random numbers that are generated. Or if I was going to repeat this with different sets of numbers then I would build a relationship table (like in ER modeling) that links the random number table to the table I am sampling. I wouldn't force everything into SQL but generate the random number table and build the relationship table using a scripting language.

In the example you give above you need 100 samples, therefore the random sampling table will contain 100 primary keys selected by an external script. Every time you run the sampling, you get exactly the same records until you change the random sampling table. To change the table, dump the primary keys of the table that you want to sample, then run a script to randomly pick 100, by generating 100 numbers between 1 and the total number of primary keys. If you use a tool like Python you can get uniform random numbers, gamma distribution, gaussian, log normal, pareto and others.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top