Question

I am loading a large amount of data (2,000 files with about 30,000 rows, 30 columns) for use in a Django application.

The files are based on 40 different templates. The column names are consistent within the templates but not between templates. We've manually mapped the columns to canonical names for the fields we are currently using and are planning on storing the rest, unmapped in json for potential future use. For example, in one template, the price of the item may be called price_per_unit and in another, it'd be called the_unit_price and we would be mapping both to unit_price.

Our current process is to load the Excel files into Postgres's json field (bulk inserting with Django) and from there into Pandas to normalize the columns and then into a couple Posgres database tables (One table has the date and item_id, the other table describes the item).

I'm using Django's bulk insert but because I'm using that, I create duplicate items which I'll have to go back and remove. Not using bulk insert and inserting line by line (to ensure I don't create duplicates as I go) takes too long in Django (over 5 minutes per file).

Should I stick with cleaning up duplicates afterward? I'm concerned that it will scale based on the overall size of our database so it works for now but may not work later.

Was it helpful?

Solution

The usual strategy is to bulk insert into a different table (a "staging" table), and then merge into the destination. An additional advantage is you aren't storing columnar data as JSON blobs, but as entries in the staging tables.

I'd be tempted to have a staging table for each template, and stored procedures to do the mapping.

Licensed under: CC-BY-SA with attribution
scroll top