Question

If you use schemaless database (particularly document-oriented databases like CouchDB, Couchbase, MongoDB) and want to change format of data representation for a particular object you may leave existing records with old format and create new records in new format. It's declared as one of major advantages of schemaless databases (I think because you can avoid downtime). On the other hand it's inconvenient and inefficient to deal with many formats of the same kind of data. So what are the good approaches/strategies to migrate data from one format to another in schemaless databases?

Was it helpful?

Solution

Like everything there are many different ways to handle this. In schemaless development, you generally are cognizant of the data you are storing. It's not that the schema is missing, all data has an implicit schema, so what we are really saying is that the database is not enforcing a schema. If I have a user object with 10 instance variables that I store in json, there IS a schema there!

Case 1: values might have different possibilities, single value, array, or a nested structure

Case 2: value needs to be changed from one format to another, ex. from single value to array of values

Case 3: existence or non-existence of a json key, this is pretty straightforward

For Case 1: if you are expecting variety in a json value, the variety of a particular value will need to be written into your App Code logic, if it's a string, do this, if it's an array, do that.

For Case 2: One approach can be to handle this as an "On Request" or "On Demand" so that you bake in the transformation logic into your class methods, so that data is transformed from one format to another format. This means that you transform data from one format to another when it is retrieved. You can also flag it to indicate you have transformed it. Since it's On Demand, you could have data that isn't "transformed" in your document store, but if it does get requested, it'll be transformed.

Alternative approach for Case 2: Iterate through and transform the data through worker processes. So rather than wait for it to be requested, you actually create a job to change data as you want it to be changed, baking in the transformation logic into the workers themselves (which can use the same class definitions in your App Code). In Couchbase you can create a View (Secondary Index) or use Elastic Search to iterate through documents of a particular type. If you create a workflow system, you can do a lot of this in parallel with many workers.

>>>> When I do transformations I generally transform one json k/v into another json k/v in a non-destructive way so that if I have made an error in my process, I do not alter original data. I can then have a later phase to remove old json k/v "On Demand", if I even feel that is necessary. This is a safer approach to this type of operation.

Appended

Case 1 & 2: Data Transformation

Original JSON Document

user::101        
{ 
  "uid": 1234,
  "type": user,
  "my_comment": "the quick brown fox jumped over the lazy dog"
  "version": 1.00
}

Now let's say I want to change it in a non-destructive way, I can easily just add a new json key that has the transformed data:

user::101        
{ 
  "uid": 1234,
  "type": user,
  "my_new_comment": ["the quick brown fox jumped over the lazy dog", "comment2"]
  "my_comment": "the quick brown fox jumped over the lazy dog",
  "version": 1.01

}

Notice it's non-destructive, the old json key is still there, alternatively I can do this, save the old data as a new key, and change the expected json key to a new format (array) instead of a string:

user::101        
{ 
  "uid": 1234,
  "type": user,
  "my_comment": ["the quick brown fox jumped over the lazy dog", "comment2"],
  "my_comment_v1.00": "the quick brown fox jumped over the lazy dog",
  "version": 1.01
}

Obviously there are quite a variety of different schemes you could use, depending on your preferences.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top