I wrote a program to generate tests composed of a combination of questions taken from a large pool of questions. There were a number of criteria for each test and the program saved them to database only if they satisfied these criteria.
My program was written to ensure as even a distribution of questions as possible, i.e., when generating combinations of questions, the algorithm prioritise questions from the pool that have been asked the least number of times in previous iterations.
I created one table, test_questions
to essentially store the test_id
for each test and another, test_questions
to store test_id
s and their corresponding question_id
s using n rows per test (where n is the number of questions in each test).
Now that I have the tests stored in a database, I’d like to check that the overlap of questions between different pairs of test are within certain bounds and I thought I should be able to do this using SQL.
Using a self-join, I was able to use this query to select the questions common to Test 3 and Test 5:
-- Get the number of questions that are common to tests 3 and 5
SELECT count(tq1.question_id) AS Overlap
FROM test_questions AS tq1
JOIN test_questions AS tq2
ON tq1.question_id = tq2.question_id
WHERE tq1.test_id = 5
AND tq2.test_id = 3;
I was able to generate each possible combination of test pairs from the first n (5) tests:
-- Get all combinations of pairs of tests from 1 to 5
SELECT t1.test_id AS Test1, t2.test_id AS Test2
FROM tests AS t1
JOIN tests AS t2
ON t2.test_id > t1.test_id
WHERE t1.test_id <= 5
AND t2.test_id <= 5;
What I’d like to do but so far have failed to do is to combine the above two queries to show each possible pair combination of the first 5 tests – along with the number of questions that are common to both tests.
-- This doesn't work
SELECT t1.test_id AS Test1, t2.test_id AS Test2, count(tq1.question_id) AS Overlap
FROM tests AS t1
JOIN tests AS t2
ON t2.test_id > t1.test_id
JOIN test_questions AS tq1
ON t1.test_id = tq1.test_id
JOIN test_questions AS tq2
ON t2.test_id = tq2.test_id
WHERE t1.test_id <= 11
AND t2.test_id <= 11
GROUP BY t1.test_id, t2.test_id;
I’ve created a simplified version (with randomised data) of the two tables at this SQL Fiddle
Note: I’m using MySQL as my DBMS but the SQL should be compatible with the ANSI standard.
Edit: The program I wrote to generate the tests actually generated more than the number of tests I needed and I only want to compare the first n tests. In the example, I added a <= 5
WHERE condition to ignore the extra tests.
To clarify what I’m looking for as per Thorsten Kettner’s example data:
test 1: a, b and c
test 2: a, b and d
test 3: d, e and f
The results would be:
Test Test Overlap
Test1 Test2 2 (a and b in common)
Test1 Test3 0 (no questions in common)
Test2 Test3 1 (d is common to both)