Generating large amounts of test data using sets of known values?

Question 1

Here is one method, that is approximate and computationally painful. It starts by assigning each module a number of students. Then it chooses students for that module randomly.

insert into student_takes_module(module_id, student_id)
    select m.module_id, s.student_id
    from (select m.*, 10 + rand() * 350 as numstudents
          from modules m
         ) m cross join
         students s cross join
         (select count(*) as totalstudents) const
    where rand() < m.numstudents/const.totalstudents;

The 350 instead of 400 is because the use of rand() in this context is approximate. The use of 10 is because I think that if you have a minimum of 10 students, then you will probably get at least one student for that class as you cycle through the data.

This approach will be processing 10,000*500 = 5,000,000 rows to generate the test data. However, the calculations are not so bad (rand() has a reputation for bad performance but that is because of confusion between the function call and order by rand()). You can test the performance by putting limit at the end to see how long it takes it to generate 10 rows, then 1000 rows, then 10000, then all the rows you need.

Question 2

Another solution could be to work with the RANDBETWEEN function in Excel to randomly pick values from module_ID to fill a new column alongside student_IDs

Alternatively Mockaroo will accept all 500 module_IDs as a custom list which it could assign to randomly generated student_IDs.

The downside to these methods is that the students will be spread very evenly among all modules.