我有一个包含两列的表,比如说名字和姓氏。我需要另一个表,其中对于第一对中的每一对名字,都包含共同姓氏的计数。

这在 SQL 中可行吗?

如果姓氏的唯一性会影响查询的效率,则姓氏的唯一性比名字的唯一性要多得多。

一个玩具示例,输入:

FirstName, LastName
John, Smith
John, Doe
Jane, Doe

输出:

FirstName1, FirstName2, CommonLastNames
John, John, 2
John, Jane, 1
Jane, Jane, 1
Jane, John, 1

由于这种关系是自反且对称的,因此如果结果只是三角形之一(例如,对角线上方的三角形)也没关系。

有帮助吗?

解决方案

我将使用 MS SQL Server 来执行此操作,因为我手头有一份副本。我相信大多数专业都会这样做。

首先是一个包含数据的示例表。我使用表变量,但它对于任何类型的表都是相同的。

declare @t table (FirstName char(10), LastName char(10));

insert @t(FirstName,LastName)
values ('John','Smith'),('John','Doe'),('Jane','Doe');

您可以通过自连接获得所有对:

select
    a.FirstName, a.LastName, b.FirstName, b.LastName
from @t as a
cross apply @t as b;

使用 CROSS APPLY 避免了为某个对象寻找连接条件的麻烦 ON 条款。

接下来你需要一些东西来计算。这就是 CASE 声明进来了。该案例返回每对名字的整数值,这就是被计数的值。(如果我正确地阅读你的问题,你想要姓氏匹配的地方,这就是我的比较。希望如果我错了,如何修改它是显而易见的。)

select
    ...
    case
        when a.LastName = b.LastName then 1
        else 0
    end
...etc.

添加一个 SUM()GROUP BY 然后你就会得到答案:

select
    a.FirstName,
    b.FirstName,
    sum(
    case
        when a.LastName = b.LastName then 1
        else 0
    end
    ) as CommonLastNames
from @t as a
cross apply @t as b
group by a.FirstName, b.FirstName;

其他提示

我必须承认我的问题有点缺陷。 我真正需要的不是“对于每对一对FirstName的第一个名称来包含一个常见的姓氏”。事实上,我不关心零计数的对。

当问题得到纠正时,解决方案变得更快。

给出输入:

create local temp table t (FirstName char(10), LastName char(10)) ON COMMIT PRESERVE ROWS;
insert into t(FirstName,LastName) values ('John','Smith');
insert into t(FirstName,LastName) values ('John','Doe');
insert into t(FirstName,LastName) values ('Jane','Doe');
.

对于原始问题,解决方案是O(n ^ 2)(因为问题是“每对”):

select a.FirstName, b.FirstName, 
  sum(case when a.LastName = b.LastName then 1 else 0 end) CommonNames 
  from t a, t b group by 1, 2;
.

如果可以跳过零计数,则在LastName上的自行连接更快(假设数据足够稀疏):

select a.FirstName, b.FirstName,
  count(*) CommonNames from t a
  join t b using (LastName) group by 1, 2;
.

我仍然想知道我是如何错过这个琐碎的解决方案。

doh!这是一种更好的方法:

SELECT city_a, city_b, COUNT(*)
    FROM (
        SELECT a.city city_a,
               a.state,
               b.city city_b
        FROM       us a
        CROSS JOIN us b
        WHERE a.state = b.state
          AND a.city < b.city
         ) x
    GROUP BY city_a, city_b
    ORDER BY 3 DESC;
.

输出:

+-----------+-------------+----------+
| city_a    | city_b      | COUNT(*) |
+-----------+-------------+----------+
| Lebanon   | Springfield |        5 |
| Bedford   | Franklin    |        4 |  -- as shown in previous 'answer'
| Franklin  | Lebanon     |        4 |
| Franklin  | Hudson      |        4 |
| Franklin  | Salem       |        4 |
| Hudson    | Salem       |        4 |
| Salem     | Springfield |        4 |
| Clinton   | Columbia    |        4 |
| Auburn    | Fairfield   |        3 |
| Auburn    | Madison     |        3 |
...
(2.63 sec) -- for all 4175 cities in `us`.
.

Sanity检查第一项:

mysql> SELECT city, state FROM us WHERE city IN ('Lebanon', 'Springfield');
+-------------+-------+
| city        | state |
+-------------+-------+
| Springfield | FL    |
| Springfield | IL    |
| Lebanon     | IN    |
| Springfield | MA    |
| Lebanon     | ME    |
| Lebanon     | MO    |
| Springfield | MO    |
| Lebanon     | NH    |
| Springfield | NJ    |
| Lebanon     | OH    |
| Springfield | OH    |
| Lebanon     | OR    |
| Springfield | OR    |
| Lebanon     | PA    |
| Springfield | PA    |
| Lebanon     | TN    |
| Springfield | TN    |
| Springfield | VA    |
| Springfield | VT    |
+-------------+-------+
19 rows in set (0.00 sec)
.

主处理程序%状态值显示它做了很多工作,但不是非常o(n * n)(可能是因为交叉连接只是一次只有一个状态):

| Handler_read_key           | 4176   |
| Handler_read_next          | 667294 |
| Handler_read_rnd           | 1742   |
| Handler_read_rnd_next      | 701964 |
| Handler_update             | 1731   |
| Handler_write              | 703693 |
.

外推到数百万行 - 它可能需要几天。

这是一个有趣的挑战。使用美国城市的列表,我想出了这个解决方案(在MySQL中):

SELECT  city_a, city_b,
        COUNT(DISTINCT state)
    FROM (
        ( SELECT a.city city_a,
                 b.city city_b,
                 a.state            -- This line differs
            FROM       us a
            CROSS JOIN us b
            WHERE a.state = b.state
              AND a.city != b.city   -- Added (to avoid noise)
              AND a.city < 'M'    -- to speed up test
              AND b.city < 'M'
        )
        UNION ALL
        ( SELECT a.city city_a,
                 b.city city_b,
                 b.state            -- This line differs
            FROM       us a
            CROSS JOIN us b
            WHERE a.state = b.state
              AND a.city != b.city   -- Added (to avoid noise)
              AND a.city < 'M'    -- to speed up test
              AND b.city < 'M'
        )
        ) ab
    GROUP BY 1, 2
    HAVING   COUNT(DISTINCT state) > 1
    ORDER BY COUNT(DISTINCT state) desc
.

INDEX(state, city)有助于性能。

结果:

+----------+------------+-----------------------+
| city_a   | city_b     | COUNT(DISTINCT state) |
+----------+------------+-----------------------+
| Franklin | Bedford    |                     4 |
| Lebanon  | Franklin   |                     4 |
| Franklin | Lebanon    |                     4 |
| Hudson   | Franklin   |                     4 |
| Columbia | Clinton    |                     4 |
| Clinton  | Columbia   |                     4 |
| Franklin | Hudson     |                     4 |
| Bedford  | Franklin   |                     4 |
| Lebanon  | Farmington |                     3 |
| Hanover  | Kingston   |                     3 |
...
(25.17 sec)
.

可能已经拍摄了4倍,只能包括整个字母表。表中只有4k行,所以这是不是一个快速任务。

“证明”结果: mysql>选择城市,来自我们的国家('富兰克林','Bedford');

+----------+-------+
| city     | state |
+----------+-------+
| Bedford  | IN    |
| Franklin | IN    |
| Bedford  | MA    |
| Franklin | MA    |
| Bedford  | NH    |
| Franklin | NH    |
| Bedford  | OH    |
| Franklin | OH    |
| Franklin | TN    |
| Bedford  | TX    |
| Franklin | WI    |
+----------+-------+
11 rows in set (0.00 sec)
.

许可以下: CC-BY-SA归因
不隶属于 dba.stackexchange
scroll top