You should definitely check out Camus API
implementation from linkedIn. Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. Check out this post I have written for a simple example which fetches from twitter stream and writes to HDFS based on tweet timestamps.
Project is available at github at - https://github.com/linkedin/camus
Camus needs two main components for reading and decoding data from Kafka and writing data to HDFS –
Decoding Messages read from Kafka
Camus has a set of Decoders which helps in decoding messages coming from Kafka, Decoders basically extends com.linkedin.camus.coders.MessageDecoder
which implements logic to partition data based on timestamp. A set of predefined Decoders are present in this directory and you can write your own based on these. camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/
Writing messages to HDFS
Camus needs a set of RecordWriterProvider classes which extends com.linkedin.camus.etl.RecordWriterProvider
that will tell Camus what’s the payload that should be written to HDFS.A set of predefined RecordWriterProvider are present in this directory and you can write your own based on these.
camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common