Question

Reddit has different buckets for Top posts. They have "This Hour", "Today", "This Week", "This Month" "This Year" "All Time". The best way I can think of to create these lists would be to save each vote with a timestamp so that you can calculate the score of a post for each bucket. This would be an expensive query but they could get away with it since Top is the same for all users and doesn't change very much so they could cache the query results.

This is just my best guess of what's going on but I'm curious, is this what Reddit is actually doing or is there a better way?

Était-ce utile?

La solution

First off, "this hour", "today", "this week", etc. all refer to when the submission (link/comment) was created, not when the votes happened. I'll focus on links here, but comments are similarly processed for display on user pages.

Short answer: a bunch of cron jobs pull the relevant time period, sort the links and group them by subreddit, then store cached lists of links for quick perusal.

To elaborate, for each time period, there's a different cron job. The "top this hour" job runs much more frequently than the "top this year" job for example. The first thing each job does is pull down a list of all links from the database that were created in the time period of interest. This gets dumped out to a text file where a primitive map-reduce system processes the data. The links are grouped and sorted. The final list of results is then put into Cassandra as a simple list of link IDs which are very quick to look up in-request.

Source: https://github.com/reddit/reddit/blob/master/scripts/compute_time_listings

FWIW, individual votes do have timestamps attached to them, but they're not directly used for tracking Top.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top