Question

Want to understand whether Netezza or Hadoop is the right choice for the below purposes:

  • Pull feed files from several online sources of considerable size at times more than a GB.

  • Clean, filter, transform and compute further information from the feeds.

  • Generate metrics on different dimensions akin to how data warehouse cubes do it, and

  • Aid webapps to access the final data/metrics faster using SQL or any other standard mechanisms.

Was it helpful?

Solution

How it works:
As the data is loaded into the Appliance, it intelligently separates each table across the 108 SPUs.
Typically, the hard disk is the slowest part of a computer. Imagine 108 of these spinning up at once, loading a small piece of the table. This is how Netezza achieves a 500 Gigabyte an hour load time.
After a piece of the table is loaded and stored on each SPU (computer on an integrated circuit card), each column is analyzed to gain descriptive statistics such as minimum and maximum values. These values are stored on each of the 108 SPUs, instead of indexes, which take time to create, updated and take up unnecessary space.
Imagine your environment without the need to create indexes. When it is time to query the data, a master computer inside of the Appliance queries the SPUs to see which ones contain the data required.
Only the SPUs that contain appropriate data return information, therefore less movement of information across the network to the Business Intelligence/Analytics Server. For joining data, it gets even better.
The Appliance distributes data in multiple tables across multiple SPUs by a key. Each SPU contains partial data for multiple tables. It joins parts of each table locally on each SPU returning only the local result. All of the ‘local results’ are assembled internally in the cabinet and then returned to the Business Intelligence/Analytics Server as a query result. This methodology also contributes to the speed story.
The key to all of this is ‘less movement of data across the network’. The Appliance only returns data required back to the Business Intelligence/Analytics server across the organization’s 1000/100 MB network.
This is very different from traditional processing where the Business Intelligence/Analytics software typically extracts most of the data from the database to do its processing on its own server. The database does the work to determine the data needed, returning a smaller subset result to the Business Intelligence/Analytics server.
Backup And Redundancy
To understand how the data and system are set up for almost 100% uptime, it is important to understand the internal design. It uses the outer, fastest, one-third part of each 400-Gigabyte disk for data storage and retrieval. One-third of the disk stores descriptive statistics and the other third stores hot data back up of other SPUs. Each Appliance cabinet also contains 4 additional SPUs for automatic fail over of any of the 108 SPUs.
Took from http://www2.sas.com

OTHER TIPS

I would consider to separate design of the batch ETL process and further SQL requests. I think the following numbers are important to evaluate the decisions:

a) How much row data you want to process daily?
b) How much row data you want to store in the system?
c) What will be size of the RDBMS dataset.
d) What kind of SQLs you are going to have? Here I mean - are there ad-hoc SQLs or well planned reports. Another questions - do you need jons between two large tables.

With above questions answered it will be possible to give better answers. For example, I would consider Netezza as option when you do need joins of very large tables, and hadoop - if you need to store terabytes of data.

It would seem from your answers that Netezza may be more suited to your needs. It handles ad-hoc queries very well and the newest version of their software has built in support for rollups and cubes. Also, Netezza operates on the scale of terabytes of data so you should be more than able to process the data you have available.

If you are dealing with ELT scenario where you have to load huge volumes of files and process it later like filter, transform and load it to tranditional databases for analytics then you can use hadoop to load the files and then Netezza as the target staging or data warehouse area. With hadoop you can put all your files into HDFS and then read using ETL tool to tranform, filter, etc or use Hive SQL to write your query the data in those files. However, hadoop based data warehouse HIve does not support updates and does not support all the SQL statements. Hence, it is better to read those files from HDFS, apply filters, transformation and load the result to traditional data warehouse appliance such as netezza to write your queries for cubes.

If you are daily loading GB of data to netezza with landing, staging and mart area then most likely you will end up using a lot of space. In this scenario you can make your landing space to be on hadoop and then make your staging and mart areas to be netezza. If you queries are simple and you are not doing very complex filtering etc or updates to source may be you can manage everything with hadoop.

To conclude, hadoop is ideal for huge volumes of data but does not support all the functionality of a traditional data warehouse.

You can check out this link to see the differences: http://dwbitechguru.blogspot.ca/2014/12/how-to-select-between-hadoop-vs-netezza.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top