Question

We have a project that loads metric data from various etls and services into an AWS SQS queue, where that data is processed and inserted into a metrics database, also running on AWS RDS.

When we initially designed the database schema, we chose postgres and enforced things like referential integrity via foreign keys. We're running into problems with this approach, because there's no guarantee that record in table A will exist before record in table B is processed by the SQS lambda.

We get lots of foreign key violation errors that cause records to get moved into the DLQ. This causes other queue messages that depend on that message to also fail, because of the violation.

Because of the distributed nature of how all this data is collected, should we drop all foreign keys in our postgres table? Was going RDBMS route even worth it in this particular scenario? I don't think it's possible for us to change architecture to a NoSql option, given how deep into the project we are, so are there any other things you can suggest that we could try to bridge the eventual consistency / relational database gap?

I should also mention the reason why the team chose the RDBMS approach is because the data needs to be loaded into a BI system where queries using joins can be done in typical SQL type fashion. So the end goal is to have RDBMS like data for reporting and analytics, but the data is collected in a distributive and eventual consistency nature.

My example wasn't clear the first time, so I've updated the question again:

We have an Oracle DB that captures files received from SFTP. We don't have access to this database, but we have access to an API that talks to it.

One of our ETLs is responsible for grabbing todays files and moving them to some location.

Another ETL is responsible for parsing the files and extracting the contents of them so that they can be inserted into MongoDB as a report.

Another ETL is responsible for looking up these report numbers in Mongo, as well as the Oracle DB and returning the results to some consumer.

The AWS part comes in when we are trying to collect metrics on our ETL performances themselves. We want to know how often a certain file is loaded, if it's a duplicate, if it has the wrong information on it, etc. The business wants to know what our data loads look like from a BI perspective.

So the AWS Postgres DB is our 'metrics' database. It captures things like when a new file is processed - when was it processed, etc. It duplicates some of the existing information in the Oracle table, but it's highly specific to our team and use case, since what's in Oracle is a general 'catch-all' for all files received via SFTP.

The report table captures things like - how often was this file loaded? Was an update made to it?

There are other tables that capture other kinds of info - how long does it take us to send a file to a third party, and receive that file back?

So because all of this information comes from multiple ETLs, as well as databases like Oracle and Mongo, the constraints cause a lot of headache.

The files received table has it's own SQS queue and endpoint, as does the report metrics. So the SQS messages are completely independent of each other, even though there is a relationship at the database level.

What we sometimes notice, is that the files received table is not populated with the data in time from when the report metrics table is being processed.

Another table tracks files we send to a third party for processing - this too also has a foreign key relationship to the files received table on the unique ID.

These metrics come from multiple etls that run at different times a day with the possibility of failing a run.

Was it helpful?

Solution

So the foreign keys come from a source system (maybe an OLTP database) where they are immutable, and the metrics database (a kind of analytics DB) reuses them.

For such a situation, I would say it is perfectly find to drop all FK constraints in the metrics database, or at least the constraints where you expect the eventual consistency issues. The keys are not generated in the target system, so there no risk of getting unwanted duplicate keys.

It is quite normal that some of the "best practices" you may be used to from OLTP databases don't apply to analytics databases. Analytics DBs can be less strictly constrained, and also more redundant than the typical OLTP database, since they are not the source of information, but contain only derived data. The data is not (inter)actively maintained in the analytics DB, but taken from a system which already guarantees a certain data quality, so there is usually no need to establish the same quality checks a second time.

Moreover, often analytics databases are mainly used for statistical purposes. So in case there are only a few references missing (between, for example, a report table and the files table), because of the latency in your queueing system, but the majority of data is available, then the missing references will probably fall into the category of statistical insignificance.

OTHER TIPS

It depends essentially on your business logic, on the use case that you need to implement.

If the use case first presents the list of files (1st table) to the user, user chooses some file, then the application looks up for the metrics (in the 2nd table) to present them to the user, then the orphaned records in the 2nd table will never be requested. In such use case, from the users point of view, no problem at all will occur, and a foreign key would not provide any important benefits. And you may want to implement some job that periodically checks consistency and deletes very old (e.g. older than one day or older than one month) orphaned records in the 2nd table, just to avoid wasting of storage space for the long time.

If the use case needs only metrics (the 2nd table), e.g. to compute some statistics or trends, and the 1st table is not much important and is used for instance just for debugging purposes, then again, foreign key is not necessary.

If the use case absolutely requires that every metrics (2nd table) refers some file (1st table), then foreign key may be helpful. Example: An order refers to a product that does not exist; saving such an order and further processing makes usually no sense and such order should not be saved, thus foreign key can be helpful.

Licensed under: CC-BY-SA with attribution
scroll top