Hadoop Custom Input Format that doesn't use files

https://stackoverflow.com/questions/16173337

11-04-2022
|

Question

I'm just getting started on Hadoop and I'm struggling to figure out how to use other input sources that aren't files, i.e. Read all the rows from AWS SimpleDB, or all records from a REST API on another system. Everything online only shows how to process files or a few selected databases.

The API for InputFormat looks quite complex, so I'm trying to figure out the quickest way to read in data from any non-file data source, which can then be MapReduced using Amazon's Elastic MapReduce (based on Hadoop). I'm using JAVA to write the code.

Thanks!

Solution

The 'quickest' way would be to use some data aggregation tool, like Flume or Chukwa. You can find a very good example on how to collect Twitter data through Flume using Twitter API here. It shows how you can use Flume to read twitter data into your Hadoop cluster and then process it using Hive. You could write your own MR job to do that if you need that. Trying to devise a custom InputFormat for these kinda things really requires some work and I don't think you'll find much help on this(Unless somebody has done this and is ready to share it with you).

HTH

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow