How to optimize cohort sizes to reduce pair-wise comparisons?

https://datascience.stackexchange.com/questions/12536

16-10-2019
|

Question

I am making all pairwise comparisons in a dataset. The use-case is collapsing records into a unique ID based on fuzzy names and dates of birth. The size of the database is around 57,000 individuals. So this is a total of 57,000 choose 2 pairwise combinations. (This is a tiny example I know, but I have other databases with the same problem that are much larger.)

Based on other analysis, I concluded that I do not want to consider people with birthdates of more than three years apart. So I can subset the database into overlapping cohorts, and then only do all pair-wise comparisons within each cohort.

It is easy to show examples of where this cohort approach will reduce the number of pairs I need to compare. Here is one example with my data just based on the quintiles of the year of birth (and those with missing birthdays go into all cohorts).

(Min,1968]  : 13,453
[1962,1980] : 17,335
[1974,1988] : 21,188
[1982,1993] : 21,993
[1987,Max)  : 17,449

Which saves me around 0.7 billion comparisons.

So this brings up two questions:

are the choosing the bins based on quantiles a good strategy, or is there anther strategy that works better?
how many bins should I make?

Solution

There is a better way if I understand your question correctly. Here is the algorithm I propose:

Initialize a 'window' list and a 'pairs' list
Sort your data on birthday from old to young (or the other way around)
Loop over your rows and keep track of all the rows that are still in the last three years since your current row. When you get to a new row, throw out rows that are now more than three years apart, add the current row together with the rows in your 'window' to your pairs set and add current row to your 'window'.

This means you only iterate over your main list once, and if you implement your 'window' list properly (like a linked list for example) you don't need to do a lot of looping in that regard either. You also get all the pairs only once as a by product. Plus you actually get all the pairs, while with your binning approach you get missing pairs around the bin edges (if you don't overlap).

If you use the overlapping binning approach I think you should have bins of 6 years width and the borders shift 3 years each time, that way all true pairs are in at least 1 bin together, and there is the least amount of unwanted pairs.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange