You are, of course, summarizing a huge number of records, then randomizing them. This kind of thing is hard to make fast. Going back to the beginning of time makes it worse. Searching on a null condition just trashes it.
If you want this to perform reasonably, you must get rid of the IS NOT NULL
selection. Otherwise, it will perform badly.
But let us try to find a reasonable solution. First, let's get the originalTweetId
values we need.
SELECT MIN(id) originalId,
MIN(tweetDate) tweetDate,
originalTweetId,
Count(*) as total
FROM twitter_gokhan2.tweetentities
WHERE originalTweetId <> -1
/*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
AND isRetweet = true
AND tweetDate < CURDATE() - INTERVAL 4 DAY
AND tweetDate > CURDATE() - INTERVAL 30 DAY /*let's add this, if we can*/
GROUP BY originalTweetId
HAVING total >= 50
This summary query gives us the lowest id number and date in your database for each subject tweet.
To get this to run fast, we need a compound index on (originalTweetId, isRetweet, tweetDate, id). The query will do a range scan of this index on tweetDate, which is about as fast as you can hope for. Debug this query, both for correctness and performance, then move on.
Now do the randomization. Let's do this with the minimum amount of data we can, to avoid sorting some enormous amount of stuff.
SELECT originalTweetId, tweetDate, total, RAND() AS randomOrder
FROM (
SELECT MIN(id) originalId,
MIN(tweetDate) tweetDate
originalTweetId,
Count(*) as total
FROM twitter_gokhan2.tweetentities
WHERE originalTweetId <> -1
/*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
AND isRetweet = true
AND tweetDate < CURDATE() - INTERVAL 4 DAY
AND tweetDate > CURDATE() - INTERVAL 30 DAY /*let's add this, if we can*/
GROUP BY originalTweetId
HAVING total >= 50
) AS retweets
ORDER BY randomOrder
LIMIT 1200
Great. Now we have a list of 1200 tweet ids and dates in random order. Now let's go get the content.
SELECT a.originalTweetId, a.total, b.tweetContent, a.tweetDate
FROM (
/* that whole query above */
) AS a
JOIN twitter_gokhan2.tweetentities AS b ON (a.id = b.id)
ORDER BY a.randomOrder
See how this goes? Use a compound index to do your summary, and do it on the minimum amount of data. Then do the randomizing, then go fetch the extra data you need.