From Oracle to Apache Parquet : how to handle eventual consistency?

https://softwareengineering.stackexchange.com/questions/404580

07-03-2021
|

Pergunta

I have an existing production Oracle Database. However, there are performance issues for certain kind of operations, because of the volume of the data, or the complexity of queries.

That's why I regularly export/dump each Oracle table to a CSV. Then such CSVs are converted to Parquet files in order to allow very high performance queries with Spark. However, my concern is about loosing strong consistency benefits.

Suppose two tables in Oracle like :

data (id, value, fk_metadata_types_id)

metadata_types (id, label)

As of now, I export regularly such two tables, then convert it to Parquet files (each Oracle table has its own set of Parquet files) in order to be ready for Spark queries.

The problem is about consistency. There are two batch, one that dump to CSV (then Parquet) the data tables, and an other that dump to CSV the metadata tables.

So basically, it can happen that at a given time, Spark read the data table with fk_metadata_types_id that doesn't already exists in the corresponding metadata Parquet tables.

How to handle such consistency issues ? The idea here is to have performant queries with Spark, but also guarantee that when the data is queried by Spark, it is always possible (strong consistency) to get the corresponding metadata_types (by a join, like an Oracle join finally).

Thanks

Solução

On the front end, most database have a snapshot isolation level where you can run multiple commands against the same database state. What this means is all transactions which completed before yours remain available and all transactions begun after yours remain unavailable. When running multiple exports under such a transaction, referential integrity should be preserved.

On the backend, in ETL speak, this problem is known as a late arriving dimension. There are multiple strategies, like holding back the incomplete records or using temporary values. For the latter, the label then would for example read future_label_labelid and would be updated in the next run.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange