Question

Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.

I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.

I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?

As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.

Was it helpful?

Solution

I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.

Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.

Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.

I can summarize your options this way:

Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.

Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.

Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.

OTHER TIPS

First of all Shark is being absorbed by Spark SQL. SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top