Frage

How do you design a website that allows users to query a large amount of user data, more specifically:

  • there are ~100 million users with ~100TB of data, data is stored in HDFS (not a database)
  • number of (concurrent) queries is not important, but each query should be as fast as possible
  • support some simple queries such as: get user info by id, get accumulated data like monthly logins and monthly online time
  • query result is little (1 number, or a few hundred rows) so frontend performance doesn't matter

I'm more interested in the thought process on how to approach this requirement. For example:

  • at 100 users, what is the design?
  • at 1,000,000 users, what needs to be changed?
  • at 100,000,000 users, what is the design now?

I've searched around and see a lot of people talking about caching, load balancing,... Of course, those techniques are useful and can be used but how do you know it can help handling N users? Nobody seems to explain this point.

War es hilfreich?

Lösung

It's fairly basic math.

The bottleneck is unlikely your database, but bandwidth.

Take your max bandwidth, divide by expected # of users, and subtract 15% for overhead.

If you really have unlimited bandwidth, then do the same calculation using your database throughput.

Andere Tipps

At this time in cloud tech, I would employ what others have already designed to handle data loads. Though you have a bit of data, I would put these data and future records into something akin to Google's BigQuery:

  • Easy to query via SQL,
  • Pay by query,
  • Handles many, many pebibytes,
  • Easily embedded into web/mobile app,
  • Maintains cache already,

By design, there is some non-cached query inertia time, but I would run away fast from trying to design, scale, script, pay-for, and maintain all the above.

Lizenziert unter: CC-BY-SA mit Zuschreibung
scroll top