Test case:
I created a somewhat realistic test case:
CREATE TEMP TABLE student_course (
student_id integer
,course_id integer
,PRIMARY KEY (student_id, course_id)
);
INSERT INTO student_course
SELECT *
FROM (VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3)) v
-- to include some non-random values in test
UNION ALL
SELECT DISTINCT student_id, normal_rand((random() * 30)::int, 1000, 35)::int
FROM generate_series(4, 5000) AS student_id;
DELETE FROM student_course WHERE random() > 0.9; -- create some dead tuples
ANALYZE student_course; -- needed for temp table
Note the use of normal_rand()
to populate the dummy table with a normal distribution of values. It's shipped with the tablefunc module, and since i am going to use that further down anyway ...
Also note the bold emphasis on the numbers I am going to manipulate for the benchmark to simulate various test cases.
Plain SQL
The question is rather basic and unclear. Find the first two students with matching courses? Or find all? Find couples of them or groups of students sharing the same courses?
Craig answers to:
Find all couples sharing the same courses.
Plain SQL With a CTE and grouping by arrays, slightly formatted:
WITH student_coursearray(student_id, courses) AS (
SELECT student_id, array_agg(course_id ORDER BY course_id)
FROM student_course
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a
JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id < b.student_id
ORDER BY a.student_id, b.student_id;
The second query in Craig's answer dropped out of the race right away. With more than just a few rows, performance quickly deteriorates badly. The CROSS JOIN
is poison.
E1 - Improved version
There is one major weakness, ORDER BY
per aggregate is a bad performer, so I rewrote with ORDER BY
in a subquery:
WITH cte AS (
SELECT student_id, array_agg(course_id) AS courses
FROM (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM cte a
JOIN cte b USING (courses)
WHERE a.student_id < b.student_id
ORDER BY 1,2;
E2 - Alternative interpretation of question
I think the generally more useful case is:
Find all students sharing the same courses.
So I return arrays of students with matching courses.
WITH s AS (
SELECT student_id, array_agg(course_id) AS courses
FROM (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
GROUP BY student_id
)
SELECT array_agg(student_id)
FROM s
GROUP BY courses
HAVING count(*) > 1
ORDER BY array_agg(student_id);
F1 - Dynamic PL/pgSQL function
To make this generic and fast I wrapped it into a plpgsql function with dynamic SQL:
CREATE OR REPLACE FUNCTION f_same_set(_tbl regclass, _id text, _match_id text)
RETURNS SETOF int[] AS
$func$
BEGIN
RETURN QUERY EXECUTE format(
$f$
WITH s AS (
SELECT %1$I AS id, array_agg(%2$I) AS courses
FROM (SELECT %1$I, %2$I FROM %3$s ORDER BY 1, 2) s
GROUP BY 1
)
SELECT array_agg(id)
FROM s
GROUP BY courses
HAVING count(*) > 1
ORDER BY array_agg(id)
$f$
,_id, _match_id, _tbl
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_same_set('student_course', 'student_id', 'course_id');
Works for any table with numeric columns. It's trivial to extend for other data types, too.
crosstab()
For a relatively small number of courses
(and arbitrarily big number of students) crosstab()
provided by the additional tablefunc module is another option in PostgreSQL. More general info here:
PostgreSQL Crosstab Query
Simple case
A simple case for the simple example in the question, much like explained in the linked answer:
SELECT array_agg(student_id)
FROM crosstab('
SELECT student_id, course_id, TRUE
FROM student_course
ORDER BY 1'
,'VALUES (1),(2),(3)'
)
AS t(student_id int, c1 bool, c2 bool, c3 bool)
GROUP BY c1, c2, c3
HAVING count(*) > 1;
F2 - Dynamic crosstab function
For the simple case, the crosstab variant was faster, so I build a plpgsql function with dynamic SQL and included it in the test. Functionally identical with F1.
CREATE OR REPLACE FUNCTION f_same_set_x(_tbl regclass, _id text, _match_id text)
RETURNS SETOF int[] AS
$func$
DECLARE
_ids int[]; -- for array of match_ids (course_id in example)
BEGIN
-- Get list of match_ids
EXECUTE format(
'SELECT array_agg(DISTINCT %1$I ORDER BY %1$I) FROM %2$s',_match_id, _tbl)
INTO _ids;
-- Main query
RETURN QUERY EXECUTE format(
$f$
SELECT array_agg(%1$I)
FROM crosstab('SELECT %1$I, %2$I, TRUE FROM %3$s ORDER BY 1'
,'VALUES (%4$s)')
AS t(student_id int, c%5$s bool)
GROUP BY c%6$s
HAVING count(*) > 1
ORDER BY array_agg(student_id)
$f$
,_id
,_match_id
,_tbl
,array_to_string(_ids, '),(') -- values
,array_to_string(_ids, ' bool,c') -- column def list
,array_to_string(_ids, ',c') -- names
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_same_set_x('student_course', 'student_id', 'course_id');
Benchmark
I tested on my small PostgreSQL test server.
PostgreSQL 9.1.9 on Debian Linux on an ~ 6 years old AMD Opteron Server. I ran 5 test sets with the above settings and each of the presented queries. Best of 5 with EXPLAIN ANALYZE
.
I used these values for the bold numbers in the above test case to populate:
nr. of students / max. nr. of courses / standard deviation (results in more distinct course_ids)
1. 1000 / 30 / 35
2. 5000 / 30 / 50
3. 10000 / 30 / 100
4. 10000 / 10 / 10
5. 10000 / 5 / 5
C1
1. Total runtime: 57 ms
2. Total runtime: 315 ms
3. Total runtime: 663 ms
4. Total runtime: 543 ms
5. Total runtime: 2345 ms (!) - deteriorates with many pairs
E1
1. Total runtime: 46 ms
2. Total runtime: 251 ms
3. Total runtime: 529 ms
4. Total runtime: 338 ms
5. Total runtime: 734 ms
E2
1. Total runtime: 45 ms
2. Total runtime: 245 ms
3. Total runtime: 515 ms
4. Total runtime: 218 ms
5. Total runtime: 143 ms
F1 victor
1. Total runtime: 14 ms
2. Total runtime: 77 ms
3. Total runtime: 166 ms
4. Total runtime: 80 ms
5. Total runtime: 54 ms
F2
1. Total runtime: 62 ms
2. Total runtime: 336 ms
3. Total runtime: 1053 ms (!) crosstab() deteriorates with many distinct values
4. Total runtime: 195 ms
5. Total runtime: 105 ms (!) but performs well with fewer distinct values
The PL/pgSQL function with dynamic SQL, sorting rows in a subquery is clear victor.