Random selection while breaking down by percentage over multiple groups

https://stackoverflow.com/questions/2060008

20-09-2019
|

Question

I'm trying to put together a simple system for a user to generate a list of users to whom surveys will be sent. The list generation may depend on various constraints. For example, "we only want people from the U.S. and Canada" or "we only want people who have a level 2 or level 3 membership."

This part is pretty easy and I've set up the tables to capture the selection criteria. One additional criteria though, is that they may want to get a certain percentage of each item. For example, "give me 70% U.S. users and 30% Canada users." Again, I think that I can do this without too much trouble. They will give the number of users that they want, so I can just multiple by the percentages then make sure that the numbers still add up after rounding and I'm good to go.

Thinking to the future though, what if they wanted certain percentage breakdowns by two sets of criteria. For example, "Give me 70% U.S., 30% Canada and at the same time, 50% level 2 users and 50% level 3 users." Since it's not a current requirement I'm not planning to give myself a headache over it, but if anyone has a reasonably simple algorithm (or SQL code) for accomplishing something like this then I'd be happy to see it.

Although I would prefer a DB-agnostic solution, I'm on MS SQL 2005, so solutions specific to that RDBMS are fine too.

The table structure which I'm currently using is similar to this:

CREATE TABLE Selection_Templates
(
     template_code     VARCHAR(20)     NOT NULL,
     template_name     VARCHAR(100)    NOT NULL,
     CONSTRAINT PK_Selection_Templates PRIMARY KEY CLUSTERED (template_code),
     CONSTRAINT UI_Selection_Templates UNIQUE (template_name)
)
GO
CREATE TABLE Selection_Template_Countries
(
     template_code            VARCHAR(20)       NOT NULL,
     country_code             CHAR(3)           NOT NULL,
     selection_percentage     DECIMAL(2, 2)     NULL,
     CONSTRAINT PK_Selection_Template_Countries PRIMARY KEY CLUSTERED (template_code, country_code),
     CONSTRAINT CK_Selection_Template_Countries_selection_percentage CHECK (selection_percentage > 0),
     CONSTRAINT FK_Selection_Template_Countries_Selection_Template FOREIGN KEY (template_code) REFERENCES Selection_Templates (template_code)
)
GO
CREATE TABLE Selection_Template_User_Levels
(
     template_code            VARCHAR(20)       NOT NULL,
     user_level               SMALLINT          NOT NULL,
     selection_percentage     DECIMAL(2, 2)     NULL,
     CONSTRAINT PK_Selection_Template_User_Levels PRIMARY KEY CLUSTERED (template_code, user_level),
     CONSTRAINT CK_Selection_Template_User_Levels_selection_percentage CHECK (selection_percentage > 0),
     CONSTRAINT FK_Selection_Template_User_Levels_Selection_Template FOREIGN KEY (template_code) REFERENCES Selection_Templates (template_code)
)

Solution

You could break down the problem into four sets of random users:

US users, level 2, choose 35% of total sample desired
Canada users, level 2, choose 15% of total sample desired
US users, level 3, choose 35% of total sample desired
Canada users, level 3, choose 15% of total sample desired

If there's a third criterion, split the problem down into eight sets. And so on.

It may seem artificial to get exactly 50% level 2 and 50% level 3 in both sets of users, US and Canada. Since it's supposed to be random, you might expect it to vary a bit more. Plus what if there aren't very many level 3 users from Canada to make up 15% of the total?

As the criteria get more and more selective, you're naturally taking away from the randomness of the total sample. Eventually you could have a long list of criteria such that only one subset of your users could satisfy it, and then there'd be no randomness at all.

Re your comment: Right, SQL isn't the best solution for every type of problem. You may be better off handling the problem with an iterative algorithm instead of a single set-based SQL query. For example:

Pick one random row.
If the row has been chosen already in a previous iteration, discard it.
If the row helps keep the pace of choosing a total sample that is 70% US, 30% Canada, 50% level 2, 50% level 3, keep it. Otherwise, discard it.
If you reach the desired number of samples, stop.
Loop back to step 1.

Of course, it gets tricky if you pick a row that helps to balance the 70/30% ratio of nations, but imbalances the 50/50% ratio of levels. Do you discard it or not? And also you may want to ignore the ratios when you've only picked the first few rows.

As @Hogan commented, this might be an unsolvable NP-Complete problem. But many such problems have a solution that gives you a "good enough" result, though not a provably optimal result.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow