Temporary storage for keeping data between program iterations? [closed]

https://stackoverflow.com/questions/4399610

10-10-2019
|

Question

I am working on an application that works like this:

It fetches data from many sources, resulting in pool of about 500,000-1,500,000 records (depends on time/day)
Data is parsed
Part of data is processed in a way to compare it to pre-existing data (read from database), calculations are made, and stored in database. Resulting dataset that has to be stored in database is, however, much smaller in size (compared to original data set), and ranges from 5,000-50,000 records. This process almost always updates existing data, perhaps adds few more records.
Then, data from step 2 should be kept somehow, somewhere, so that next time data is fetched, there is a data set which can be used to perform calculations, without touching pre-existing data in database. I should point out that this data can be lost, it's not irreplaceable (key information can be read from database if needed), but it would speed up the process next time.

Application components can (and will be) run off different computers (in the same network), so storage has to be reachable from multiple hosts.

I have considered using memcached, but I'm not quite sure should I do so, because one record is usually no smaller than 200 bytes, and if I have 1,500,000 records, I guess that it would amount to over 300 MB of memcached cache... But that doesn't seem scalable to me - what if data was 5x that amount? If it were to consume 1-2 GB of cache only to keep data in between iterations (which could easily happen)?

So, the question is: which temporary storage mechanism would be most suitable for this kind of processing? I haven't considered using mysql temporary tables, as I'm not sure if they can persist between sessions, and be used by other hosts in network... Any other suggestion? Something I should consider?

Solution

I know this sounds very old-school, but a temp file on your SAN would be easy and cheap.

Loading a 300M file at the start of each run is trivial compared to consuming 300M of cache all the time.

And if you can recreate it from the database keys, it would be wise to write and test that part and make it automatic that if the temp file was unavailable, the info would be mined from the keys and recreated.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow