Question

I am working on a web front end + front end services.

I receive good sized csv files (10k lines). My service processes them and condenses them into one larger csv file (up to 300k lines).

This larger file will be turned into an html/pdf report after some extrapolation.

My questions are:

  1. Taking 17,000 files and turning them into 1 takes FOREVER (18 hours last time I tried it). The current process is to take a line of the csv, parse it to see if it exists in my master array, and either create a new entry or add the data to an existing entry in the array. Is there a better way to do this? It seems the last item would take exponentially longer than the first.

  2. Once this large file is created, parsing it seems to take quite a while as well. Should I move away from writing to a csv output and go with JSON for speed of data massaging? or even a lightweight db?

Was it helpful?

Solution

I believe you are trying to recreate the concept of a database management system the hard way. File I/O combined with parsing and re-parsing your data is what kills your performance.

Option 1: Handle the merging yourself

a) Put your master "array" into a database as a set of rows in (one or more) table(s).

b) Read in your files, and merge the results in to the tables.

Option 2: Let the database handle the merging

a) Put your master "array" into one or more tables in a database (table A). Construct your indices.

b) Import a file into your database into tables similar in form to the master tables, but separate and temporary (table B).

c) Merge the master and imported data using LEFT JOIN to produce a temporary update table C. (INNER JOIN)

d) Antimerge the master and imported data by finding all records that are not in the master table and putting them into a temporary table D. (RIGHT EXCLUDING JOIN)

e) Perform an update from table C into table A. Then, add all records from table D to table A.

(For an excellent view of the JOIN terminology, I use this set of diagrams and code: Visual Representation)

OTHER TIPS

Using JSON instead isn't obvious, the format has greater complexity than CSV and is best used for exchanging petty data structures, preferably containing technical information. To speed stuff up, ensure that the master collection has a realistic intial capacity to contain that many rows, because resizing large collections is very expensive. Second, order the collection so that if the candidate row is greater than the last element, no further iteration is required. This won't work with Guid's.

Licensed under: CC-BY-SA with attribution
scroll top