It's worth noting that those benchmarks have no purpose whatsoever. They are mostly specifically designed to do things which are inefficient in rules engines. They even have very little value for comparison between engines, given that you're unlikely to ever write a real-world application that is anything like Miss Manners.
If you just want large amounts of data for your tests, there is loads of open data out there. For instance, the UK provides a variety of open data sets. You can pick one which suits your experiment here.
http://data.gov.uk/data/search
Or you could grab a load of gene sequence data from GenBank:
http://www.ncbi.nlm.nih.gov/genbank/
There's loads of free data out there, for which you could write rules.
If you are really looking to benchmark rules engines, then it would probably be better to generate the data yourself. That's the best way to ensure that you get reliable statistical variations.
However, all you will be doing is benchmarking a specific set of rules. Any such benchmarks would be redundant as soon as the rules change.