Data Transformation Help - Variety of Documents - Distinct Fields

https://stackoverflow.com//questions/22070760

23-12-2019
|

Question

Let us say, I want to transfer data from 1 MongoDB cluster with 50 million records to another one where the self-imposed 'schema' has changed drastically and I want to test the import + conversion before actually running it.

I am able to find a list of distinct fields just fine, but I want to pull a variety of documents so that each distinct field is pulled. This data would then be the source to test my Map-Reduce script.

The issue arose due to many years of using and changes in the way of saving the stored data. What originally was user.orgId became user.organizationid.

Any suggestions? Even on 3rd party tools?

Solution

Basically it seems like you have two related questions:

How can I run an import and conversion without affecting the final collection.
How can I verify that the documents in a collection match a particular schema definition.

Both questions have a variety of appropriate answers.

For question 1.

a. You can create a temporary duplicate of your cluster: then run your import and conversion in this environment. This is the safest way.

b. You can simply run the import and conversion with a different final collection. This isn't as safe as a, because it requires the developer to be diligent with selecting the appropriate collections at test time, and at final deployment time.

Question 2.

This depends very much on the environment you are developing for, which I don't know anything about. But, for the sake of an example, if you were working in python, you could use something like: https://pypi.python.org/pypi/jsonschema, and iterate over each document confirming that it conforms to the schema you require. If you already have an ODM in place, and have mappings that describe the schema, if should be possible to validate documents using the mapping.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow