design implementation for a postgres database in a not connected environment

https://softwareengineering.stackexchange.com/questions/350261

13-01-2021
|

Question

I have several postgres DB, on distant sites, two in China, 1 in India , one in Korea, one in Germany, one in France and one in Mexico.

All dbs, regardless of their site, have one table that is part of a schema. That table is updated through an excel spreadsheet. The spreadsheet is filled by a physical person.

Joe will be updating through his spreasheet the table in Korea. Jane will be updating the table through her spreasheet in India and Dustin and Ahmed, will be updating through their spreasheet the table in China. And so on for the French, German and Mexican sites.

We want these 7 dbs to replicate their content into the main corporate database. The amount of data is tiny, 1mb per day but in a constant stream that is done hour by hour and sometimes minute by minute e.g 10Kb at 8Am, 200Kb at 9Am and so on.

We would like to transfer the data as soon as Dustin will have filed the excel spreadsheet in China into the corporate database. As you can guess, we will only need to copy the data from the distant site to the corporate site and not the other way.

And last but not least, to connect to the corporate network, the china site or indian site or the korean and the other sites as well do not have a vpn connection (IPSec) currently set.

Is pgq/londiste , for postgres, an optimized replication solution to set for the amount of data we have (read tiny)?
Would a copy table from the local site to a db in the cloud like RDS and then a copy to the corporate db would be a sound idea? Maybe easier to setup but I have the feeling of redundancy here even though it will probably save us of setting an IPSec route.
If no, what other solution can we use?
And last question, should we set an ipsec between the corporate firewall and the sites firewall to allow replication?

Thanks

Solution

My suggestion is that you implement a service layer between your humans and each site's database. Host a central service in your main corporate network. Have the humans upload their spreadsheets to the service, and let the service control how the data gets into the database.

This allows you a central point of control over the data.
You don't have to deal with data replication, you can simply modify the service to stick a row in both the central corporate database and the offsite database.
It will give you control of the format of the data, let you change the schema, validate, control access, etc.
It has better longevity. From experience automated replication schemes are hard to maintain and monitor. They are also hard when it comes to duplicate data or merging.
You won't (necessarily) need to establish a VPN. You can simply use TLS on your service with an appropriately secure authentication and authorization scheme on your service.
Latency between data being out of sync will be very low (milliseconds) where in a replication scheme data usually has a constant latency (e.g. replication every 5 minutes would have a 5 minute period where data is out of sync).
You can remove human access to databases. You don't have to worry that one of your humans will delete all your data. This may also mean you need fewer licenses and you can save costs (not important for PostgreSQL)

I previously worked at a company where we did implement a kind of automated replication between many MySql databases and a large SqlServer database as you're describing. It was very cumbersome, and we eventually replaced it with a service-oriented approach which worked much nicer.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange