Is this the best way to get a bounded, random result, uniformly distributed over a column value?

https://stackoverflow.com/questions/17047914

31-05-2022
|

Question

I have a need to take a bounded random user sample that is uniformly distributed across the US geography.

I am wondering if the following approach is reasonable to achieve my goal in Oracle 11g, given that I have zip-code information about the users.

My approach is to use zip-codes to accomplish the geographic distribution across the US.

To ensure the selected zip-codes are uniformly distributed across the US: - I am using the "zone" (first two digits).

To get a random distribution across the zip-code zone:

For each zip-code I assign a random number.
I then sort the groups of zip-code zones by the random number and select the top N (10 in the following example).

Lastly, (not included below) I would retrieve a set of random set of users from each zip-code.

select
  y.ZIP_CODE
from
  (
  select
    x.ZIP_CODE
   ,x.ZIP_CODE_ZONE
   ,row_number()
    over (
      partition by x.ZIP_CODE_ZONE
      order by x.random_num
    )
     AS row_number
  from
    (
    select 
      ZIP_CODE
     ,ZIP_CODE_ZONE
     ,dbms_random.value(0,100000) as random_num
    from
      ZIP_CODE_TABLE
    where 1=1
      and COUNTRY_CODE = 'US'
    ) x
  ) y
where
  y.row_number <= 10
order by 
  y.zip_code
 ,y.row_number
;

Update:

After more research, I found that this approach will not fulfill the goal of getting a "uniformly" distributed set of information across the US. The result is still biased because zip-code zones are "small" near "large" population centers. So the chance of retrieving a user in these small zones is larger than the larger zones, so users cluster together near population centers.

Solution

Data from the database is retrieved in unsorted manner. You can get the same results if you have already some data in buffer cache.

select y.ZIP_CODE
from
(
    select 
          ZIP_CODE
        , ZIP_CODE_ZONE
        , row_number() over (partition by ZIP_CODE_ZONE order by dbms_random.value(0,100000)) AS row_number
    from  ZIP_CODE_TABLE
    where COUNTRY_CODE = 'US'
) y
where y.row_number <= 10
order by 
      y.zip_code
    , y.row_number
;

OTHER TIPS

How about

select ZIP_CODE from ZIP_CODE_TABLE SAMPLE (0.1);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow