Question

Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application?

By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and separate scalding job JAR from 'library' code packages. This in turn allowed me to run scalding based Hadoop jobs from outside cluster with minimal overhead per job ('library' code is posted to cluster 'distributed cache' only when it changes (which is rarely needed) so I can load jobs code fast).

Now I'm actually starting to play with HBase itself and I see scalding is good but it is not so 'native' to HBase. Yes, there are things like hbase-scalding but as I anyway have some point to plan future actions I'd like to know about other good solutions I probably missed.

What is expected:

  • Applications (jobs) starting overhead should be low. I need to run lot of them.
  • It should be possible (easier - better) to run jobs from outside cluster without any SSH (just based on 'hadoop jar' command or even simply by application execution).
  • Jobs language itself should allow short, logical semantic. Ideally this code should be simple enough to be automatically generated.
  • This solution should be productive on big enough HBase tables (initially up to 100.000.000 entries).
  • OK, solution should be 'live' (being actively developing) but relatively good in terms of general stability.

I think argumentation here could be even more useful than solution itself and this question should add couple of ideas for many people. Any peace of advice?

Was it helpful?

Solution 2

HPaste http://www.gravity.com/labs/hpaste/ may be what you are looking for.

OTHER TIPS

If you're using scalding (which I recommend) there's a new project with updated cascading and scalding wrappers for accessing HBase. You might want to check it out - https://github.com/ParallelAI/SpyGlass

You may be interested in the Kiji project (https://github.com/kijiproject/). It provides a "schema-ed" layer on top of HBase.

It also has a Scalding adapter (KijiExpress) so that you can do functional collections operations (map, groupby, etc.) on "pipes" of tuples sourced from these schema-ed HBase tables.

Update (August 2014): Stratosphere is now called Apache Flink (incubating)

Check out Stratosphere. If offers a Scala API and has a HBase module and is under active development.

  • Starting a job should be possible within a sec or so (depends on your cluster size.)
  • You can submit jobs remotely (it has a class called RemoteExecutor which allows you to programmatically submit jobs on remote clusters)

Please contact me if you have further questions!

I am currently trying to maintain hbase-scalding at my free time. As I am also picking up Scala.

Please take a look at github

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top