Question

I'm new to coding and I don't know how what I'm trying to do is called, and I want to find out so I can do more research and not re-invent the wheel. I looked for things such as data consolidation/combine input but the results are either too vague or if I add the name of a language I get irrelevant libraries (eg javascript: this or this)

What I'm trying to do is to consolidate data from difference sources:
- I have different data sources of things (let's say organizations)
- I'm trying to consolidate those sources so that I have 1 unique representation of each thing
- Sources might have different values for each thing (eg sourceA might have organization name as "My Funny App" and sourceB might have "MyFunny")
- So I need to some logic to consolidate the data (ie which source to keep?)

I'm not asking for the consolidation rules as I understand this is really use-case specific and part of it can be done through normalization (normalized values more likely to match, less conflict) however I'm thinking that anybody doing this type of work would have to perform some common tasks:

ASSIGNING VALUES:
- add missing values
- for conflicting/existing values, try to resolve
- resolve based on some criteria that define a score for each value
- pick the value with the highest score

DEBUGGING:
- keep track of all values (original/normalized) and their sources/scores
- have log showing how each thing is modified as it goes through each source

Does what I'm trying to do have a specific name? Are their any design patterns to do this more efficiently?

Was it helpful?

Solution

The data consolidation described in the question can be handled on pretty large scale by using semantic web/linked data and related RDF store technology.

The main idea is to add original data with all desired metadata from all sources to the an RDF graph (or multiple graphs) "data lake". After that, the distilled ("output") data can be materialized to the same graph by using rule engine and/or inference engine, provided ontologies for data sources and desired output data (this is needed in any case in one form or another if sources are disparate). Or alternatively, it can be calculated on the fly from the original data, by queries (depends on a more specific case).

All mentioned cases (missing values, conflicting values, keep track of all values, etc) are covered by this way to do ETL (extract, transfer, load).

One virtue of semantic technology is that is uses Open World assumption and "anyone can say anything about anything" (AAA principle). This means, that technology can be very flexible in resolving conflicts by using provenance information or some scores and rules mentioned in the question.

Some brief explanation can be found for example in these slides by David Booth, Ph.D.

One more term (in addition to Data Consolidation and ETL) for what you seek (technologically) is Enterprise Information Integration: It has similar challenges.

The question does not specify what does it mean "efficiently". Whether it's the speed of the process or efficiency of aligning data sources. The semantic technology can be better on the aligning side, though modern RDF stores and graph databases are quite efficient and capable.

Please, note, that this is just one possible technology. The choice of technology may be influenced by a lot of nuances, not last of which is what development team is more familiar with.

OTHER TIPS

So far I understand, you need a level of abstraction.

This abstraction is an unique representation of each thing.

Then for each source, you will need an concretization of this abstraction.

That way you will be able to regroup different data under the same behavior and treat all of them in the same manner.

If you need a more concrete answer, please provide more context for your problem.

Licensed under: CC-BY-SA with attribution
scroll top