query optimization to find random sample

Question 1

You are, of course, summarizing a huge number of records, then randomizing them. This kind of thing is hard to make fast. Going back to the beginning of time makes it worse. Searching on a null condition just trashes it.

If you want this to perform reasonably, you must get rid of the IS NOT NULL selection. Otherwise, it will perform badly.

But let us try to find a reasonable solution. First, let's get the originalTweetId values we need.

 SELECT MIN(id) originalId,
        MIN(tweetDate) tweetDate,
        originalTweetId, 
        Count(*) as total
   FROM twitter_gokhan2.tweetentities
  WHERE originalTweetId <> -1 
  /*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
    AND isRetweet = true
    AND tweetDate < CURDATE() - INTERVAL 4 DAY
    AND tweetDate > CURDATE() - INTERVAL 30 DAY  /*let's add this, if we can*/
  GROUP BY originalTweetId
 HAVING total >= 50

This summary query gives us the lowest id number and date in your database for each subject tweet.

To get this to run fast, we need a compound index on (originalTweetId, isRetweet, tweetDate, id). The query will do a range scan of this index on tweetDate, which is about as fast as you can hope for. Debug this query, both for correctness and performance, then move on.

Now do the randomization. Let's do this with the minimum amount of data we can, to avoid sorting some enormous amount of stuff.

 SELECT originalTweetId, tweetDate, total, RAND() AS randomOrder
   FROM (
    SELECT MIN(id) originalId,
           MIN(tweetDate) tweetDate
           originalTweetId, 
           Count(*) as total
      FROM twitter_gokhan2.tweetentities
     WHERE originalTweetId <> -1 
     /*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
       AND isRetweet = true
       AND tweetDate < CURDATE() - INTERVAL 4 DAY
       AND tweetDate > CURDATE() - INTERVAL 30 DAY  /*let's add this, if we can*/
      GROUP BY originalTweetId
     HAVING total >= 50
   ) AS retweets
  ORDER BY randomOrder
  LIMIT 1200

Great. Now we have a list of 1200 tweet ids and dates in random order. Now let's go get the content.

 SELECT a.originalTweetId, a.total, b.tweetContent, a.tweetDate
   FROM (
          /* that whole query above */
        ) AS a
   JOIN twitter_gokhan2.tweetentities AS b ON (a.id = b.id)
  ORDER BY a.randomOrder

See how this goes? Use a compound index to do your summary, and do it on the minimum amount of data. Then do the randomizing, then go fetch the extra data you need.

Question 2

You're selecting a huge number of records by selecting every record older than 4 days old....

Since the query takes a huge amount of time, why not simply prepare the results using an independant script which runs repeatedly in the background....

You might be able to make the assumption that if its a retweet, the originalTweetId cannot be null/-1

Question 3

Just to clarify... did you really mean to query everything OLDER than 4 days???

 AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))

OR... Did you mean you only wanted to aggregate RECENT TWEETS WITHIN the last 4 days. To me, tweets that happened 2 years ago would be worthless to current events... If thats the case, you might be better to just change to

 AND (tweetDate >= DATE_ADD(CURDATE(), INTERVAL - 4 DAY))

Question 4

See if this isn't a bit faster than 40 minutes:

Test first without the commented lines, then re-add them to compare performance impact. (especially ORDER BY RAND() is known to be horrible)

SELECT
    originalTweetId,
    total,
--    tweetContent,  -- may slow things somewhat
    tweetDate
FROM (
  SELECT
    originalTweetId,
    COUNT(*) AS total,
--    tweetContent,  -- may slow things somewhat
    MIN(tweetDate) AS tweetDate,
    MAX(isRetweet) AS isRetweet
  FROM twitter_gokhan2.tweetentities
  GROUP BY originalTweetId
) AS t
WHERE originalTweetId > 0
  AND isRetweet
  AND tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY)
  AND total > 50
-- ORDER BY RAND()  -- very likely to slow performance,
                    -- test with and without...
LIMIT 0, 1200;

PS - originalTweetId should be indexed hopefully