题
我有一个包含两列的表,比如说名字和姓氏。我需要另一个表,其中对于第一对中的每一对名字,都包含共同姓氏的计数。
这在 SQL 中可行吗?
如果姓氏的唯一性会影响查询的效率,则姓氏的唯一性比名字的唯一性要多得多。
一个玩具示例,输入:
FirstName, LastName
John, Smith
John, Doe
Jane, Doe
输出:
FirstName1, FirstName2, CommonLastNames
John, John, 2
John, Jane, 1
Jane, Jane, 1
Jane, John, 1
由于这种关系是自反且对称的,因此如果结果只是三角形之一(例如,对角线上方的三角形)也没关系。
解决方案
我将使用 MS SQL Server 来执行此操作,因为我手头有一份副本。我相信大多数专业都会这样做。
首先是一个包含数据的示例表。我使用表变量,但它对于任何类型的表都是相同的。
declare @t table (FirstName char(10), LastName char(10));
insert @t(FirstName,LastName)
values ('John','Smith'),('John','Doe'),('Jane','Doe');
您可以通过自连接获得所有对:
select
a.FirstName, a.LastName, b.FirstName, b.LastName
from @t as a
cross apply @t as b;
使用 CROSS APPLY
避免了为某个对象寻找连接条件的麻烦 ON
条款。
接下来你需要一些东西来计算。这就是 CASE
声明进来了。该案例返回每对名字的整数值,这就是被计数的值。(如果我正确地阅读你的问题,你想要姓氏匹配的地方,这就是我的比较。希望如果我错了,如何修改它是显而易见的。)
select
...
case
when a.LastName = b.LastName then 1
else 0
end
...etc.
添加一个 SUM()
和 GROUP BY
然后你就会得到答案:
select
a.FirstName,
b.FirstName,
sum(
case
when a.LastName = b.LastName then 1
else 0
end
) as CommonLastNames
from @t as a
cross apply @t as b
group by a.FirstName, b.FirstName;
其他提示
我必须承认我的问题有点缺陷。 我真正需要的不是“对于每对一对FirstName的第一个名称来包含一个常见的姓氏”。事实上,我不关心零计数的对。
当问题得到纠正时,解决方案变得更快。
给出输入:
create local temp table t (FirstName char(10), LastName char(10)) ON COMMIT PRESERVE ROWS;
insert into t(FirstName,LastName) values ('John','Smith');
insert into t(FirstName,LastName) values ('John','Doe');
insert into t(FirstName,LastName) values ('Jane','Doe');
.
对于原始问题,解决方案是O(n ^ 2)(因为问题是“每对”):
select a.FirstName, b.FirstName,
sum(case when a.LastName = b.LastName then 1 else 0 end) CommonNames
from t a, t b group by 1, 2;
.
如果可以跳过零计数,则在LastName上的自行连接更快(假设数据足够稀疏):
select a.FirstName, b.FirstName,
count(*) CommonNames from t a
join t b using (LastName) group by 1, 2;
.
我仍然想知道我是如何错过这个琐碎的解决方案。
doh!这是一种更好的方法:
SELECT city_a, city_b, COUNT(*)
FROM (
SELECT a.city city_a,
a.state,
b.city city_b
FROM us a
CROSS JOIN us b
WHERE a.state = b.state
AND a.city < b.city
) x
GROUP BY city_a, city_b
ORDER BY 3 DESC;
.
输出:
+-----------+-------------+----------+
| city_a | city_b | COUNT(*) |
+-----------+-------------+----------+
| Lebanon | Springfield | 5 |
| Bedford | Franklin | 4 | -- as shown in previous 'answer'
| Franklin | Lebanon | 4 |
| Franklin | Hudson | 4 |
| Franklin | Salem | 4 |
| Hudson | Salem | 4 |
| Salem | Springfield | 4 |
| Clinton | Columbia | 4 |
| Auburn | Fairfield | 3 |
| Auburn | Madison | 3 |
...
(2.63 sec) -- for all 4175 cities in `us`.
.
Sanity检查第一项:
mysql> SELECT city, state FROM us WHERE city IN ('Lebanon', 'Springfield');
+-------------+-------+
| city | state |
+-------------+-------+
| Springfield | FL |
| Springfield | IL |
| Lebanon | IN |
| Springfield | MA |
| Lebanon | ME |
| Lebanon | MO |
| Springfield | MO |
| Lebanon | NH |
| Springfield | NJ |
| Lebanon | OH |
| Springfield | OH |
| Lebanon | OR |
| Springfield | OR |
| Lebanon | PA |
| Springfield | PA |
| Lebanon | TN |
| Springfield | TN |
| Springfield | VA |
| Springfield | VT |
+-------------+-------+
19 rows in set (0.00 sec)
.
主处理程序%状态值显示它做了很多工作,但不是非常o(n * n)(可能是因为交叉连接只是一次只有一个状态):
| Handler_read_key | 4176 |
| Handler_read_next | 667294 |
| Handler_read_rnd | 1742 |
| Handler_read_rnd_next | 701964 |
| Handler_update | 1731 |
| Handler_write | 703693 |
.
外推到数百万行 - 它可能需要几天。
这是一个有趣的挑战。使用美国城市的列表,我想出了这个解决方案(在MySQL中):
SELECT city_a, city_b,
COUNT(DISTINCT state)
FROM (
( SELECT a.city city_a,
b.city city_b,
a.state -- This line differs
FROM us a
CROSS JOIN us b
WHERE a.state = b.state
AND a.city != b.city -- Added (to avoid noise)
AND a.city < 'M' -- to speed up test
AND b.city < 'M'
)
UNION ALL
( SELECT a.city city_a,
b.city city_b,
b.state -- This line differs
FROM us a
CROSS JOIN us b
WHERE a.state = b.state
AND a.city != b.city -- Added (to avoid noise)
AND a.city < 'M' -- to speed up test
AND b.city < 'M'
)
) ab
GROUP BY 1, 2
HAVING COUNT(DISTINCT state) > 1
ORDER BY COUNT(DISTINCT state) desc
.
INDEX(state, city)
有助于性能。
结果:
+----------+------------+-----------------------+
| city_a | city_b | COUNT(DISTINCT state) |
+----------+------------+-----------------------+
| Franklin | Bedford | 4 |
| Lebanon | Franklin | 4 |
| Franklin | Lebanon | 4 |
| Hudson | Franklin | 4 |
| Columbia | Clinton | 4 |
| Clinton | Columbia | 4 |
| Franklin | Hudson | 4 |
| Bedford | Franklin | 4 |
| Lebanon | Farmington | 3 |
| Hanover | Kingston | 3 |
...
(25.17 sec)
.
可能已经拍摄了4倍,只能包括整个字母表。表中只有4k行,所以这是不是一个快速任务。
“证明”结果: mysql>选择城市,来自我们的国家('富兰克林','Bedford');
+----------+-------+
| city | state |
+----------+-------+
| Bedford | IN |
| Franklin | IN |
| Bedford | MA |
| Franklin | MA |
| Bedford | NH |
| Franklin | NH |
| Bedford | OH |
| Franklin | OH |
| Franklin | TN |
| Bedford | TX |
| Franklin | WI |
+----------+-------+
11 rows in set (0.00 sec)
.