Efficient data importing?

https://stackoverflow.com/questions/2608453

25-09-2019
|

Question

We work with a lot of real estate, and while rearchitecting how the data is imported, I came across an interesting issue.

Firstly, the way our system works (loosely speaking) is we run a Coldfusion process once a day that retrieves data provided from an IDX vendor via FTP. They push the data to us. Whatever they send us is what we get.

Over the years, this has proven to be rather unstable.

I am rearchitecting it with PHP on the RETS standard, which uses SOAP methods of retrieving data, which is already proven to be much better than what we had.

When it comes to 'updating' existing data, my initial thought was to query only for data that was updated. There is a field for 'Modified' that tells you when a listing was last updated, and the code I have will grab any listing updated within the last 6 hours (give myself a window in case something goes wrong).

However, I see a lot of real estate developers suggest creating 'batch' processes that run through all listings regardless of updated status that is constantly running.

Is this the better way to do it? Or am I fine with just grabbing the data I know I need? It doesn't make a lot of sense to me to do more processing than necessary. Thoughts?

Solution

If you can trust the MODIFIED, you approach is good enough and faster than parsing the whole file. If you are looping the whole find trying to find what is modified or not it will be harder.

There will a slight performance improvement if you can send the file directly to the Database and let it import with its built in file import functions. For example MySQL has this - http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html I am sure other DB like SQL Server and Oracle have such mechanisms too. This will save you the overhead of the programming language work and only use the Database resources.

OTHER TIPS

If you are lucky enough to have a last updated date, then it is certainly more efficient to limit the number of rows you are updating or importing. Just be sure that the date is actually being updated by the data source when records change as not all databases are well-designed and don't always do things the way you would expect.

And whatever you do, don't process row-by-row, that is the most inefficent way to do things. I can do a batch insert of a million records faster than I can process a data set of 100000 records one row at a time. But there are set-based ways to do the type if insert you are talking about and a set-based update of the 100 records that changed is going to be faster than a batch insert of 50,000 records.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow