Here is one method, that is approximate and computationally painful. It starts by assigning each module a number of students. Then it chooses students for that module randomly.
insert into student_takes_module(module_id, student_id)
select m.module_id, s.student_id
from (select m.*, 10 + rand() * 350 as numstudents
from modules m
) m cross join
students s cross join
(select count(*) as totalstudents) const
where rand() < m.numstudents/const.totalstudents;
The 350
instead of 400
is because the use of rand()
in this context is approximate. The use of 10
is because I think that if you have a minimum of 10 students, then you will probably get at least one student for that class as you cycle through the data.
This approach will be processing 10,000*500 = 5,000,000 rows to generate the test data. However, the calculations are not so bad (rand()
has a reputation for bad performance but that is because of confusion between the function call and order by rand()
). You can test the performance by putting limit
at the end to see how long it takes it to generate 10 rows, then 1000 rows, then 10000, then all the rows you need.