Question

A colleague has challenged me with an interesting question. To be honest I have no idea how to deal with it.

Suppose the following:

Every 5 minutes you get a new file with one hundred thousand new records. You need to store the records in a database table within these 5 minutes.

  • First, I'd stream that file because loading up everything in memory may cause a stack overflow
  • Second, inserting data in a batch way

What coding strategy would you use to cope with the amount of data and to stay within this short timeframe for each file you receive?

Was it helpful?

Solution

Bulk Insert Operations are your friend(s).

If it's a straight insert only (new data only in files) simply bulk insert the data directly to the table. Most databases have utilities to do bulk insert operations, some even expose those libraries to code as well if using command line utilities and batch files seem old fashioned.

If it's an insert/update/delete type of scenario, bulk insert into a staging table and then use RDMS specific technology to update the target table.

For example, SQL Server provides a nice MERGE command to merge data to a target.

Bulk inserting 100,000 records will only take a second or so. If your in the insert/update/delete scenario it will take a few more seconds to merge the data from the staging table to the main table.

With this approach you will be able to meet the 5 minute window.

If there is really this amount of data coming in every 5 minutes, then you will need data partitioning strategy as well to help manage the data in the database.

60/5 = 12 * 24 = 288

288*100,000 = 28,800,000 ~29 million records a day. 870 million records per month.

I would develop an archiving/purge strategy as well.

OTHER TIPS

My coding strategy would necessarily have to consider how/when you are going to read those records.

Otherwise, if you have no read requirements, I wouldn't even put the records into a database; instead I'd just leave them in the files (or even throw them away).

Your read requirements combined with your ingestion requirements will dictate what kind of database to use, for one.

The question is rather broad and leaves a lot of information unstated.

I would also consider the schema of those records, such as, number of tables, types of primary keys.

You mention the records are new rather than updates of existing, so a no-SQL database might handle this nicely, depending again on your unstated read requirements.

Licensed under: CC-BY-SA with attribution
scroll top