Question

I wrote a program to generate tests composed of a combination of questions taken from a large pool of questions. There were a number of criteria for each test and the program saved them to database only if they satisfied these criteria.

My program was written to ensure as even a distribution of questions as possible, i.e., when generating combinations of questions, the algorithm prioritise questions from the pool that have been asked the least number of times in previous iterations.

I created one table, test_questions to essentially store the test_id for each test and another, test_questions to store test_ids and their corresponding question_ids using n rows per test (where n is the number of questions in each test).

Now that I have the tests stored in a database, I’d like to check that the overlap of questions between different pairs of test are within certain bounds and I thought I should be able to do this using SQL.

Using a self-join, I was able to use this query to select the questions common to Test 3 and Test 5:

-- Get the number of questions that are common to tests 3 and 5
SELECT count(tq1.question_id) AS Overlap
FROM test_questions AS tq1
JOIN test_questions AS tq2
ON tq1.question_id = tq2.question_id
WHERE tq1.test_id = 5
AND tq2.test_id = 3;

I was able to generate each possible combination of test pairs from the first n (5) tests:

-- Get all combinations of pairs of tests from 1 to 5
SELECT t1.test_id AS Test1, t2.test_id AS Test2
FROM tests AS t1
JOIN tests AS t2
ON t2.test_id > t1.test_id
WHERE t1.test_id <= 5
AND t2.test_id <= 5;

What I’d like to do but so far have failed to do is to combine the above two queries to show each possible pair combination of the first 5 tests – along with the number of questions that are common to both tests.

-- This doesn't work
SELECT t1.test_id AS Test1, t2.test_id AS Test2, count(tq1.question_id) AS Overlap
FROM tests AS t1
JOIN tests AS t2
ON t2.test_id > t1.test_id
JOIN test_questions AS tq1
ON t1.test_id = tq1.test_id
JOIN test_questions AS tq2
ON t2.test_id = tq2.test_id
WHERE t1.test_id <= 11
AND t2.test_id <= 11
GROUP BY t1.test_id, t2.test_id;

I’ve created a simplified version (with randomised data) of the two tables at this SQL Fiddle

Note: I’m using MySQL as my DBMS but the SQL should be compatible with the ANSI standard.

Edit: The program I wrote to generate the tests actually generated more than the number of tests I needed and I only want to compare the first n tests. In the example, I added a <= 5 WHERE condition to ignore the extra tests.

To clarify what I’m looking for as per Thorsten Kettner’s example data:

test 1: a, b and c
test 2: a, b and d
test 3: d, e and f

The results would be:

Test    Test    Overlap
Test1   Test2   2       (a and b in common)
Test1   Test3   0       (no questions in common)
Test2   Test3   1       (d is common to both)
Was it helpful?

Solution

You just need a group by to your first query (basically). I also added another condition, so the test ids are produced in order:

SELECT tq1.test_id as test_id1, tq2.test_id as test_id2, count(tq1.question_id) AS Overlap
FROM test_questions tq1 LEFT JOIN
     test_questions tq2
     ON tq1.question_id = tq2.question_id and
        tq1.test_id < tq2.test_id
GROUP BY tq1.test_id, tq2.test_id;

This is standard SQL.

If you want to get all pairs of tests, even those that have no questions in common, here is another approach:

SELECT t1.test_id as test_id1, t2.test_id as test_id2, count(tq2.question_id) AS Overlap
FROM tests t1 CROSS JOIN
     tests t2 LEFT JOIN
     test_questions tq1
     on t1.test_id = tq1.test_id LEFT JOIN
     test_questions tq2
     ON t2.test_id = tq2.test_id and tq1.question_id = tq2.question_id 
GROUP BY t1.test_id, t2.test_id;

This assumes that you have a table with one row per test. If not, replace tests with (select distinct test from test_questions).

OTHER TIPS

I modified Gordon's answer and this query provides a listing of test combinations along with their corresponding overlap (questions in common):

SELECT tq1.test_id as test_id1, tq2.test_id as test_id2, count(tq1.question_id) AS Overlap
FROM test_questions tq1
JOIN test_questions tq2
ON tq1.question_id = tq2.question_id
AND tq1.test_id < tq2.test_id 
WHERE tq1.test_id <= 5
AND tq2.test_id <= 5
GROUP BY tq1.test_id, tq2.test_id;
  • First step: Find all test combinations, for instance: 1-2, 1-3, 2-3
  • Second step: Join all questions of the first test.
  • Third step: Outer join the equal question of the second test if it exists.
  • Last step: Count the equal questions found per test combination.
    select test_combinations.t1_test_id, test_combinations.t2_test_id, count(q2.question_id)
    from
    (
        select t1.test_id as t1_test_id, t2.test_id as t2_test_id
        from (select test_id from tests where test_id  t1.test_id
    ) test_combinations
    inner join test_questions q1 on q1.test_id = test_combinations.t1_test_id
    left join test_questions q2 on q2.test_id = test_combinations.t2_test_id and q2.question_id = q1.question_id
    group by test_combinations.t1_test_id, test_combinations.t2_test_id
    order by test_combinations.t1_test_id, test_combinations.t2_test_id;

I've added a test with no overlapping questions to your fiddle and removed the restriction to test_id <= 5, so you see pairs of tests with zero overlapping questions: http://sqlfiddle.com/#!2/e83aa/1

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top