How to process huge amount of data with limited processing ressources? [closed]

https://softwareengineering.stackexchange.com/questions/378103

07-02-2021
|

Pergunta

I receive daily from an external source a very large amount of data (around 250GB with 260 million rows of fixed width text) distributed over 5 text files. I am writing a Java application that should combine a first group of data (files 1-4) with a second group (file 5) based on some business logic.

But accessing/reading multiple times from 250 GB of text files is pretty time-consuming. So I decided to find a more efficient way to process my data. I think to store the data into a database (for example mysql Workbench) and make the processing using the database instead of the text files. This database would be dropped after the processing is done.

Could this approach of using a temporary database improve the performance compared to text files ? Or are there any better suggestions how to design this mass processing ?

Note: my application has to be run on Windows R2 Server with 32Gb of RAM an Intel Xeon Processor E5645 and 1 TB hard-disk

Solução

It is difficult to give a simple answer without knowing how the 4 first files are related between them, how the business logic combines data, and if any assumption can be made on ordering of files. Nevertheless, here some general ideas to help you to evaluate yourself the approach you consider.

Your data is fixed length, which means easy to parse, compare and convert for the file approach and the database approach.

The database requires to import all the data before starting the processing. This means parsing and converting all the input fields of all the files. This also means to build indexes for fields that require fast search. This could finally mean additional overhead for transactional integrity management.

This overhead can be minimized:

If you have only a few indexes, the cost of building them should in principle be smaller than sorting the text files (because sorting text files require several full rewrites of every data).
The temporary database tables can be defined as having only the fields that are relevant for the business logic, the reminder of each input text line being put in large fixed size text fields. This could reduce the conversion overhead (e.g. dates, numbers, ...) during the import to its bare minimum and in principle at the same level than converting the text in the files on your own. This would also reduce internal database operations when fetching rows (the more fields, the longer it gets to build the internal in-memory datasets).
Many databases hava a bulk upload feature that allows to disable temporary the transactional integrity during the import, thus further reducing one of the heavy upload tasks.

Database engines have features that can significantly accelerate the data processing:

The use of database indexes may avoid to repeatedly read big parts of files just to locate a few records and could significantly increase performance (except if the text files are sorted according to the same field).
In general, a query optimizer whill automatically optimise queries (that would require a careful manual analyse without it).
Database caching algorithms are used to optimize access (especially repeated access).

Conclusion: Unless your text files are sorted according to the criteria of the grouping logic, and unless you could find a single pass algorithm to combine your data, there are high chances that the database access will enable to outperform the raw text file approach.

Important remark: the heaviest and most delicate part of the database approach will be the import (especially with your older machine). Fortunately, you can assess the feasibility of this approach with very limited effort: define the database structure, use the sql engine you are familiar with, and try using the mysqlimport utility.

Outras dicas

You are a bit brief on the kind of processing you need to do on those text files. But most likely you do not want to use a relational database system as a processing tool. That would cost you a ton of extra memory, disk space and processing power/time. You want to touch that raw data not more often than you need.

You may want to store intermediate results in a relational database system though, but that should already be interpreted data from your raw text files, mapped to the smallest possible codes. You may have a field COUNTRY in your flat files, if you have no more than about a hundred possible countries you could map that to a single byte enum for instance. Perhaps you know all valid countries upfront, then you can hard-code a dictionary in your program and determine the enum for it as you pass it reading the file. See how the knowledge you have of the data you get can help you optimize the processing.

Try to do as much as possible in a single pass, reading files 1-4 alternating with file 5, combining data as you go, if that is possible. Whatever you can ignore after reading over it once is a win.

Then you may be able to work with the people providing the data. You will likely not be interested in all of it. They may be able to filter or condense things for you at the source. Now that would be agile!

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange