Pairwise intersections

https://dba.stackexchange.com/questions/103961

vertica

26-09-2020
|

Question

I have a table with two columns, let's say FirstName and LastName. I need to get another table, which for every pair of FirstName's from the first one contains a count of the common LastName's.

Is this even feasible to do in SQL?

There are much more unique LastName's than FirstName's, if this affects efficiency of the query.

A toy example, input:

FirstName, LastName
John, Smith
John, Doe
Jane, Doe

Output:

FirstName1, FirstName2, CommonLastNames
John, John, 2
John, Jane, 1
Jane, Jane, 1
Jane, John, 1

As this relation is reflexive and symmetrical, it's OK if the result is just one of the triangles (e.g, the one above the diagonal).

Solution

I'm going to use MS SQL Server to do this since I have a copy to hand. I believe most any of the majors would do it similarly.

First a sample table, with data. I use a table variable but it's the same for any flavour of table.

declare @t table (FirstName char(10), LastName char(10));

insert @t(FirstName,LastName)
values ('John','Smith'),('John','Doe'),('Jane','Doe');

You can get all pairs by doing a self-join:

select
    a.FirstName, a.LastName, b.FirstName, b.LastName
from @t as a
cross apply @t as b;

Using CROSS APPLY avoids having to jump through hoops to find a join condition for an ON clause.

Next you need something to count. This is where the CASE statement comes in. The case returns an integer value per pair of first names, which is what gets counted. (If I'm reading your question correctly you want where the LastNames match so that's the comparison I have. Hopefully it's obvious how to modify this if I'm wrong.)

select
    ...
    case
        when a.LastName = b.LastName then 1
        else 0
    end
...etc.

Add in a SUM() and GROUP BY and you get your answer:

select
    a.FirstName,
    b.FirstName,
    sum(
    case
        when a.LastName = b.LastName then 1
        else 0
    end
    ) as CommonLastNames
from @t as a
cross apply @t as b
group by a.FirstName, b.FirstName;

OTHER TIPS

I have to admit my question was a bit flawed. What I really needed was not "for every pair of FirstName's from the first one contains a count of the common LastName's". In fact, I do not care about pairs with zero counts.

When the question is corrected, the solution becomes much faster.

Given the input:

create local temp table t (FirstName char(10), LastName char(10)) ON COMMIT PRESERVE ROWS;
insert into t(FirstName,LastName) values ('John','Smith');
insert into t(FirstName,LastName) values ('John','Doe');
insert into t(FirstName,LastName) values ('Jane','Doe');

For the original question, the solution is O(N^2) (because the question insists on "every pair"):

select a.FirstName, b.FirstName, 
  sum(case when a.LastName = b.LastName then 1 else 0 end) CommonNames 
  from t a, t b group by 1, 2;

If it's OK to skip zero counts, then a self join on LastName works much faster (assuming data is sufficiently sparse):

select a.FirstName, b.FirstName,
  count(*) CommonNames from t a
  join t b using (LastName) group by 1, 2;

I still wonder how I missed this trivial solution.

Doh! Here's a better way:

SELECT city_a, city_b, COUNT(*)
    FROM (
        SELECT a.city city_a,
               a.state,
               b.city city_b
        FROM       us a
        CROSS JOIN us b
        WHERE a.state = b.state
          AND a.city < b.city
         ) x
    GROUP BY city_a, city_b
    ORDER BY 3 DESC;

Output:

+-----------+-------------+----------+
| city_a    | city_b      | COUNT(*) |
+-----------+-------------+----------+
| Lebanon   | Springfield |        5 |
| Bedford   | Franklin    |        4 |  -- as shown in previous 'answer'
| Franklin  | Lebanon     |        4 |
| Franklin  | Hudson      |        4 |
| Franklin  | Salem       |        4 |
| Hudson    | Salem       |        4 |
| Salem     | Springfield |        4 |
| Clinton   | Columbia    |        4 |
| Auburn    | Fairfield   |        3 |
| Auburn    | Madison     |        3 |
...
(2.63 sec) -- for all 4175 cities in `us`.

Sanity check on first item:

mysql> SELECT city, state FROM us WHERE city IN ('Lebanon', 'Springfield');
+-------------+-------+
| city        | state |
+-------------+-------+
| Springfield | FL    |
| Springfield | IL    |
| Lebanon     | IN    |
| Springfield | MA    |
| Lebanon     | ME    |
| Lebanon     | MO    |
| Springfield | MO    |
| Lebanon     | NH    |
| Springfield | NJ    |
| Lebanon     | OH    |
| Springfield | OH    |
| Lebanon     | OR    |
| Springfield | OR    |
| Lebanon     | PA    |
| Springfield | PA    |
| Lebanon     | TN    |
| Springfield | TN    |
| Springfield | VA    |
| Springfield | VT    |
+-------------+-------+
19 rows in set (0.00 sec)

The main Handler% STATUS values show that it did a lot of work, but not quite O(N*N) (probably because the CROSS JOIN is only one state at a time):

| Handler_read_key           | 4176   |
| Handler_read_next          | 667294 |
| Handler_read_rnd           | 1742   |
| Handler_read_rnd_next      | 701964 |
| Handler_update             | 1731   |
| Handler_write              | 703693 |

Extrapolating to millions of rows -- It will probably take days.

That was an interesting challenge. Using a list of US cities, I came up with this solution (in MySQL):

SELECT  city_a, city_b,
        COUNT(DISTINCT state)
    FROM (
        ( SELECT a.city city_a,
                 b.city city_b,
                 a.state            -- This line differs
            FROM       us a
            CROSS JOIN us b
            WHERE a.state = b.state
              AND a.city != b.city   -- Added (to avoid noise)
              AND a.city < 'M'    -- to speed up test
              AND b.city < 'M'
        )
        UNION ALL
        ( SELECT a.city city_a,
                 b.city city_b,
                 b.state            -- This line differs
            FROM       us a
            CROSS JOIN us b
            WHERE a.state = b.state
              AND a.city != b.city   -- Added (to avoid noise)
              AND a.city < 'M'    -- to speed up test
              AND b.city < 'M'
        )
        ) ab
    GROUP BY 1, 2
    HAVING   COUNT(DISTINCT state) > 1
    ORDER BY COUNT(DISTINCT state) desc

INDEX(state, city) helps with performance.

The results:

+----------+------------+-----------------------+
| city_a   | city_b     | COUNT(DISTINCT state) |
+----------+------------+-----------------------+
| Franklin | Bedford    |                     4 |
| Lebanon  | Franklin   |                     4 |
| Franklin | Lebanon    |                     4 |
| Hudson   | Franklin   |                     4 |
| Columbia | Clinton    |                     4 |
| Clinton  | Columbia   |                     4 |
| Franklin | Hudson     |                     4 |
| Bedford  | Franklin   |                     4 |
| Lebanon  | Farmington |                     3 |
| Hanover  | Kingston   |                     3 |
...
(25.17 sec)

It might have taken 4 times as long to include the entire alphabet. There were only 4K rows in the table, so this is not a fast task.

"Proof" of the results: mysql> SELECT city, state FROM us WHERE city IN ('Franklin', 'Bedford');

+----------+-------+
| city     | state |
+----------+-------+
| Bedford  | IN    |
| Franklin | IN    |
| Bedford  | MA    |
| Franklin | MA    |
| Bedford  | NH    |
| Franklin | NH    |
| Bedford  | OH    |
| Franklin | OH    |
| Franklin | TN    |
| Bedford  | TX    |
| Franklin | WI    |
+----------+-------+
11 rows in set (0.00 sec)

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange