I have a need to take a bounded random user sample that is uniformly distributed across the US geography.
I am wondering if the following approach is reasonable to achieve my goal in Oracle 11g, given that I have zip-code information about the users.
My approach is to use zip-codes to accomplish the geographic distribution across the US.
To ensure the selected zip-codes are uniformly distributed across the US:
- I am using the "zone" (first two digits).
To get a random distribution across the zip-code zone:
- For each zip-code I assign a random number.
- I then sort the groups of zip-code zones by the random number and
select the top N (10 in the following example).
Lastly, (not included below) I would retrieve a set of random set of users from each zip-code.
select
y.ZIP_CODE
from
(
select
x.ZIP_CODE
,x.ZIP_CODE_ZONE
,row_number()
over (
partition by x.ZIP_CODE_ZONE
order by x.random_num
)
AS row_number
from
(
select
ZIP_CODE
,ZIP_CODE_ZONE
,dbms_random.value(0,100000) as random_num
from
ZIP_CODE_TABLE
where 1=1
and COUNTRY_CODE = 'US'
) x
) y
where
y.row_number <= 10
order by
y.zip_code
,y.row_number
;
Update:
After more research, I found that this approach will not fulfill the goal of getting a "uniformly" distributed set of information across the US. The result is still biased because zip-code zones are "small" near "large" population centers. So the chance of retrieving a user in these small zones is larger than the larger zones, so users cluster together near population centers.