One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange
You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.