What's the proper approach to multi-threaded database diffing?

https://stackoverflow.com/questions/23573061

19-07-2023
|

質問

My company obtains a Microsoft Access database from Cerner Multum, which needs to be diffed against our production backend, which is Sybase (12.0.1.3924). And while I'm aware of off-the-shelf database diff tools (http://www.diffkit.org/, http://www.liquibase.org/), none seems to fit my need - as such, I decided to write a Java tool to perform the work as proof-of-concept.

As it stands, the tool is currently working as designed, and here's the procedure:

Obtain a list of tables to be diffed from a config file
Establish a connection to both backends
Ensure the tables in the config file can be matched against both MS Access and Sybase
If so, proceed with the diff:
- For each table:
  - Obtain a row from MS Access, instantiate an object via reflection
  - Iterate over each column in the row, stuffing data into the newly created "Access" POJO
  - Using the Access POJO, construct a query for Sybase
  - Query Sybase:
  - If the result set is NULL, insert a record in Sybase
  - If the result set not NULL, instantiate another POJO and stuff Sybase data into it.
  - Compare the two POJOs:
  - If the POJOs match: do nothing, move on to the next row.
  - If the POJOs do not match: perform an update to Sybase using the data from the Access POJO

Now as stated, this is currently getting the job done, albeit in a very procedural, single-threaded manner, and therein lies my question: what is the proper approach to diffing two databases (that happen to be unrelated) in a multi-threaded manner?

I have some experience with multi-threading, but am unsure as to the correct approach as I've never queued inserts/updates. That said, I'm not entirely certain queuing is the proper approach - what about bulk updates/inserts?

Would someone with some experience in this area offer some high level insight as to how to approach this problem? As it stands, I'm churning over 1.5m rows in about 2 hours, which is roughly 200 TPS. Very slow. Any guidance would be greatly appreciated, and I'd to happy to offer additional information if necessary.

解決

Having a tool that does this kind of work correctly is, in my experience, very valuable. It may be slow but if it is fast enough, changing it to make it faster is not worth the risk of incorrect results.

Having said that, the current diff-procedure for each table lends itself well for multi-threading. The procedure probably loses most of the time in network latency when communicating with the (Sybase) database that might need to be updated. Having a couple of threads do this in parallel will help the throughput.

Let one thread read the records from a table from the input (MS Access) database and put the Access Pojos in a concurrent queue (e.g. ConcurrentLinkedQueue). Let a number of threads read the Access Pojos from this queue and execute the update procedure in parallel.
When there are no more records in the table, let the read thread put special "end of table" Access Pojos in the queue so that the update-threads know when to stop. Also, the read thread needs to pause when the queue gets too large (or use an ArrayBlockingQueue).
Repeat for the next table.

The idea here is that current source code is moved without being altered too much (which minimizes the risk of breaking stuff): the read thread gets a Runnable object with the current code for reading from the MS Access database and creating an Access Pojo (and does this in a loop), the write threads get a Runnable with the current code for comparing and updating the Sybase database.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow