Question

I'm looking at using AVRO on hadoop. But I am concerned with serialization of large data-structures and how to add methods to the (data-) classes.

The example (taken from http://blog.voidsearch.com/bigdata/apache-avro-in-practice/) shows a model of facebook users.

{
  "namespace": "test.avro",
  "name": "FacebookUser",
  "type": "record",
  "fields": [
      {"name": "name", "type": "string"},
      ...,
      {"name": "friends", "type": "array", "items": "FacebookUser"} 
  ]
}

Does avro serialize the complete social graph of a facebookuser in this model?

[That is, if I want to serialize one user, does the serialization include all it's friends and their friends and so on?]

If the answer is yes, I'd rather store ID's of friends instead of references, to look up in my application whenever needed. In that case I would like to be able to add a method that returns the actual friends instead of ID's.

How can I wrap/extend generated AVRO java classes to add methods?

(also to add methods that return for example friend-count)

Was it helpful?

Solution

Regarding the second question: How can I wrap/extend generated AVRO java classes to add methods?

You can use the AspectJ to inject new methods into an existing/generated class. AspectJ is required only at compile-time. Approach is illustrated below.

Define a Person record as Avro IDL (person.avdl):

@namespace("net.tzolov.avro.extend")
protocol PersonProtocol {
    record Person {
        string firstName;
        string lastName;
    }     
}

use maven and the avro-maven-plugin to generate java sources from the AVDL:

<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>1.6.3</version>
</dependency>
    ......
    <plugin>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro-maven-plugin</artifactId>
        <version>1.6.3</version>
        <executions>
            <execution>
                <id>generate-avro-sources</id>
                <phase>generate-sources</phase>
                <goals>
                    <goal>idl-protocol</goal>
                </goals>
                <configuration>
                    <sourceDirectory>src/main/resources/avro</sourceDirectory>
                    <outputDirectory>${project.build.directory}/generated-sources/java</outputDirectory>
                </configuration>
            </execution>
        </executions>
    </plugin>

Above configuration presumes that the person.avid file is in src/main/resources/avro. Sources are generated in target/generated-sources/java.

Generated Person.java has two methods: getFirstName() and getLastName(). If you want to extend it with another method: getCompleteName() = firstName + lastName then you can inject this method with the following aspect:

package net.tzolov.avro.extend;

import net.tzolov.avro.extend.Person;

public aspect PersonAspect {

    public String Person.getCompleteName() {        
        return this.getFirstName() + " " + this.getLastName();
    }
}

Use the aspectj-maven-plugin maven plugin to weave this aspect with the generated code

<dependency>
    <groupId>org.aspectj</groupId>
    <artifactId>aspectjrt</artifactId>
    <version>1.6.12</version>
</dependency>
<dependency>
    <groupId>org.aspectj</groupId>
    <artifactId>aspectjweaver</artifactId>
    <version>1.6.12</version>
</dependency>
    ....
<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>aspectj-maven-plugin</artifactId>
    <version>1.2</version>
    <dependencies>
        <dependency>
            <groupId>org.aspectj</groupId>
            <artifactId>aspectjrt</artifactId>
            <version>1.6.12</version>
        </dependency>
        <dependency>
            <groupId>org.aspectj</groupId>
            <artifactId>aspectjtools</artifactId>
            <version>1.6.12</version>
        </dependency>
    </dependencies>
    <executions>
        <execution>
            <goals>
                <goal>compile</goal>
                <goal>test-compile</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <source>6</source>
        <target>6</target>
    </configuration>
</plugin>

and the result:

@Test
public void testPersonCompleteName() throws Exception {

    Person person = Person.newBuilder()
            .setFirstName("John").setLastName("Atanasoff").build();

    Assert.assertEquals("John Atanasoff", person.getCompleteName());
}

OTHER TIPS

I am trying to answer first quesion first:
In best of my understanding AVRO is not built to store something not hierarchial. It also do not have notation of object ids. It can store arrays, records of primitive types or any combinations of them. The capability to travere object's graph you refer to is capabilty of Java Serialization whic AVRO lacing
So to store some graph you should introduce you own object IDs and explicitely assign them to some fields. You can take a look into getSchema method here: http://www.java2s.com/Open-Source/Java/Database-DBMS/hadoop-0.20.1/org/apache/avro/reflect/ReflectData.java.htm it is fairly simple... It is a way AVRO generates schema by the java class.
Regarding the second question - i do not think it is good idea to modify generated code. I would suggest to make class with all method / data you want to add and put AVRO generated "data" class as a member there.
In the same time, I think that technically extending generated classes should be ok.

Beyond trying to solve these issues with Avro, which or may not work (my guess is that extending generated class will not work well no matter how you try), you could consider using plain JSON (unless you have specific requirement for Avro). Many libraries support arbitrary POJO mappings; and some (like Jackson) also support Object Id based serialization (with 2.0.0).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top