Question

I have a csv large file (>1GB) sitting in network filestorage which gets updated weekly with new records. The file has colums similar to these:

Customer ID | Product | Online? (Bool) | Amount | Date

I need to use this file to update a postgresql database of customer IDs with the total amount in each month by product and store. Something like this:

Customer ID | Month | (several unrelated fields) | Product 1 (Online) | Product 1 (Offline) | Product 2 (Online) | ect...

Because the file is so large (and getting steadily larger with each update) I need an efficient way to grab the updated records and update the database. Unfortunately, our server updates the the file by Customer ID and not date, so I can't tail it.

Is there a clever way to diff the file in a way that won't break as the file keeps growing?

Was it helpful?

Solution

COPY the file to a staging table. This assumes of course you have a PK aka a unique identifier for each row that doesn't mutate. I checksum remaining columns and the same for the rows you already loaded into your destination table and compare source to destination this will find the updates, deletes, and new rows.

As you can see I haven't added any indexes or tuned this in any other way. My goal was to make it function correctly.

create schema source;
create schema destination;

--DROP TABLE source.employee; 
--DROP TABLE destination.employee;

select x employee_id, CAST('Bob' as text) first_name,cast('H'as text) last_name, cast(21 as integer) age
INTO source.employee
from generate_series(1,10000000) x;

select x employee_id, CAST('Bob' as text) first_name,cast('H'as text) last_name, cast(21 as integer) age
INTO destination.employee
from generate_series(1,10000000) x;

select 
destination.employee.*,
source.employee.*,
CASE WHEN (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM' 
     WHEN (destination.employee.employee_id IS NULL) THEN 'Missing'
     WHEN (source.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType
FROM destination.employee
FULL OUTER JOIN source.employee
             on destination.employee.employee_id = source.employee.employee_id
WHERE (destination.employee.employee_id IS NULL OR source.employee.employee_id IS NULL)
   OR (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age));

--Mimic source data getting an update.
UPDATE source.employee
SET age = 99
where employee_id = 45000;

select 
destination.employee.*,
source.employee.*,
CASE WHEN (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM' 
     WHEN (destination.employee.employee_id IS NULL) THEN 'Missing'
     WHEN (source.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType
FROM destination.employee
FULL OUTER JOIN source.employee
             on destination.employee.employee_id = source.employee.employee_id
WHERE (destination.employee.employee_id IS NULL OR source.employee.employee_id IS NULL)
   OR (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age));

OTHER TIPS

Don't store data in a CSV > 1 gigabyte. Store it in a file called something like current_week_sales. At the end of the week schedule a script which renames it to something like 2014_12_sales and creates a new, empty current_week_sales.

The only truly efficient solution is to get control of the program which is creating that file, and make it do something more sensible instead.

If you can't do that, > 1 GB just isn't that big unless it is >> 1GB. Just recompute the whole thing. If that is slow, then make it faster. There is no reason that computing a few summaries on 1GB should be slow.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top