I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB). How to store them and access them very efficiently?

https://stackoverflow.com/questions/8453670

12-03-2021
|

Question

This is an interview question :

Suppose: I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB). How to store them and access them very efficiently ?

My ideas : They want to test the knowledge about handling large volume of data efficiently. It is not an only-one-correct-answer question.

Save them into some special data structure ?

Actually I have no ideas about this kind of open-end question.

Any help is really appreciated.

Solution

It really depends on the data-set in question. I think the point is for you to discuss the alternatives and describe the various pros/cons.

Perhaps you should answer their question with more questions!

How will it need to be accessed? (sequentially, randomly, some predictable distribution?)
Is the order of elements important?
Will the size of elements change?
How important is insert/remove performance?

The data structure you choose will depend on what kinds of trade-offs you are willing to make.

For example, if you only ever need to iterate over the set sequentially, perhaps you could should use a linked list as this it has a relatively small storage overhead.

If instead you need random access, you might want to look into:

Hash-tables (constant time lookup, but need a good hash function for the data)
Some kind of index / tree structure?
Caching! You probably won't be able to keep it all in memory - and even if you can you want to take advantage of data locality where possible.

TL;DR: It's all problem dependent. There are many alternatives.

This is essentially the same problem faced by file systems / databases.

OTHER TIPS

I would use some distributed form of B-tree. B-tree is able to store huge ammounts of data with very good access times (the tree is usually not very deep, but very broad). Thanks to this property, it is used for indexing in relational databases. And it also wont be very difficult to distribute it among many nodes (computers).

I think, that this answer must be sufficient for an interview...

The easiest and lowest-cost (at least until you massively scale up) option would be to use an existing service like Amazon S3.

Well, I would use a DHT and split it in chunks of 8MB. Then have a table with the filehash(SHA-1 256), filename, and chunks.

The chunks would be stored the chunks in 3 different NAS. Have 1200 TB NAS servers and load balancer(s) to get any of the 3 copies that are more convenient to get at the time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow