Designing a big data web app
https://softwareengineering.stackexchange.com/questions/387335
-
21-02-2021 - |
Frage
How do you design a website that allows users to query a large amount of user data, more specifically:
- there are ~100 million users with ~100TB of data, data is stored in HDFS (not a database)
- number of (concurrent) queries is not important, but each query should be as fast as possible
- support some simple queries such as: get user info by id, get accumulated data like monthly logins and monthly online time
- query result is little (1 number, or a few hundred rows) so frontend performance doesn't matter
I'm more interested in the thought process on how to approach this requirement. For example:
- at 100 users, what is the design?
- at 1,000,000 users, what needs to be changed?
- at 100,000,000 users, what is the design now?
I've searched around and see a lot of people talking about caching, load balancing,... Of course, those techniques are useful and can be used but how do you know it can help handling N users? Nobody seems to explain this point.
Lösung
It's fairly basic math.
The bottleneck is unlikely your database, but bandwidth.
Take your max bandwidth, divide by expected # of users, and subtract 15% for overhead.
If you really have unlimited bandwidth, then do the same calculation using your database throughput.
Andere Tipps
At this time in cloud tech, I would employ what others have already designed to handle data loads. Though you have a bit of data, I would put these data and future records into something akin to Google's BigQuery:
- Easy to query via SQL,
- Pay by query,
- Handles many, many pebibytes,
- Easily embedded into web/mobile app,
- Maintains cache already,
By design, there is some non-cached query inertia time, but I would run away fast from trying to design, scale, script, pay-for, and maintain all the above.