Scalable architecture for many uses and large amount of data (MSMQ?)

https://stackoverflow.com/questions/9396348

29-10-2019
|

Question

im designing a system where i will have multiple users uploading large amount of data. My initial example is 100 users uploading 100Mb each every day.

I need to get the data, insert it into a database, process the data in the database (ETL) and then use the "polished" data for analysis.

The uploaded files will be received in chunks of 65k (initial design).

To avoid getting bottlenecks im thinking on building this using a MSMQ where i put the data into the MQ and then pass it on to different "programs/tools" that will process the data and in turn signal to the ETL tool via MSMQ to start doing its thing.

Alternatively im thinking on a "linear" approach:

--> receive data 
--> save data to sql 
--> wait for upload finish (run the two above until no more chunks)
--> signal the ETL to do its thing
--> When ETL is done report "Done" to callee

Which approach seem to be the better one? Is there any alternatives to look into? The ambition is to have several thousands of users... As far as i see this approach it locks the client/downloader.

Solution

I prefer the first approach. The advantage over the second approach would be that you can send and process the MSMQ messages asynchronously and have them transactional secure with very little efort.

Not that the second efford would not work - but the first looks like much less effort to me.

I also suggest that you might want to look at some frameworks that sit on top of the MSMQ. As a C# programmer I can recommend NServiceBus - but I do not know what you might using.

OTHER TIPS

I suggest that after you've received the data you sort it according to the most often-used index of the target table. You should do this in RAM and you can either sort it 100MB at a time or all 100*100MB (it's only 10 GB of RAM) in one big sort. That way the block insertion will be faster (the indexing component will have less to do) and subsequent selects will find related rows more bunched together (physically next to each other on the disk) and less randomly spread inside the table. This will result in fewer physical reads for a given select thereby improving execution times.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow