Question

I have been reading a lot about Apache Avro these days and I am more inclined towards using it instead of using JSON. Currently, what we are doing is, we are serializing the JSON document using Jackson and then writing that serialize JSON document into Cassandra for each row key/user id.

Then we have a REST service that reads the whole JSON document using the row key and then deserialize it and use it further.

Now while reading on the web it looks like, Avro requires a schema beforehand... I am not sure how to come up with a schema in Apache Avro for my JSON document.

Below is my JSON document that I am writing into Cassandra after serializing it using Jackson. Now how to come up with an Avro schema for the below JSON?

{
  "lv" : [ {
    "v" : {
      "site-id" : 0,
      "categories" : {
        "321" : {
          "price_score" : "0.2",
          "confidence_score" : "0.5"
        },
        "123" : {
          "price_score" : "0.4",
          "confidence_score" : "0.2"
        }
      },
      "price-score" : 0.5,
      "confidence-score" : 0.2
    }
  } ],
  "lmd" : 1379231624261
}

Can anyone provide a simple example on this, how to come up with a schema in Avro basis on my above JSON document? Thanks for the help.

Was it helpful?

Solution

The simplest way to define an avro schema as you have outlined above would be to start from what they call IDL. IDL is a high-level language than the Avro schema (json) and makes writing avro schema much more straight-forward..

See avro IDL here: http://avro.apache.org/docs/current/idl.html

To define what you've got above in JSON, you're going to define a set of records in IDL that look like this:

@namespace("com.sample")
protocol sample {
   record Category {
      union {null, string} price_score = null;
      union {null, string} confidence_score = null;
   }
   record vObject {
      int site_id = 0;
      union {null, map<Category>} categories = null;
      union {null, float} price_score = null;
      union {null, float} confidence_score = null;
   }

   record SampleObject {
      union {null, array<vObject>} lv = null;
      long lmd = -1;
   }
}

When you run the compiler tool (as listed on that website above), you will get an avro schema generated like so:

{
  "protocol" : "sample",
  "namespace" : "com.sample",
  "types" : [ {
    "type" : "record",
    "name" : "Category",
    "fields" : [ {
      "name" : "price_score",
      "type" : [ "null", "string" ],
      "default" : null
    }, {
      "name" : "confidence_score",
      "type" : [ "null", "string" ],
      "default" : null
    } ]
  }, {
    "type" : "record",
    "name" : "vObject",
    "fields" : [ {
      "name" : "site_id",
      "type" : "int",
      "default" : 0
    }, {
      "name" : "categories",
      "type" : [ "null", {
        "type" : "map",
        "values" : "Category"
      } ],
      "default" : null
    }, {
      "name" : "price_score",
      "type" : [ "null", "float" ],
      "default" : null
    }, {
      "name" : "confidence_score",
      "type" : [ "null", "float" ],
      "default" : null
    } ]
  }, {
    "type" : "record",
    "name" : "SampleObject",
    "fields" : [ {
      "name" : "lv",
      "type" : [ "null", {
        "type" : "array",
        "items" : "vObject"
      } ],
      "default" : null
    }, {
      "name" : "lmd",
      "type" : "long",
      "default" : -1
    } ]
  } ],
  "messages" : {
  }
}

Using whatever language you'd like, you can now generate a set of objects and the default "toString" operation is to output in JSON form as you have above. However, the true power of Avro comes with it's compression capabilities. You should truly write out in avro binary format to see the real benefits of avro.

Hope this helps!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top