Experience with Hadoop?

https://stackoverflow.com/questions/17721

09-06-2019
|

Question

Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?

I'm also interested into any performance results you have...

Solution

Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.

This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.

OTHER TIPS

Hadoop MapReduce can run ontop of any number of file systems or even more abstract data sources such as databases. In fact there are a couple of built-in classes for non-HDFS filesystem support, such as S3 and FTP. You could easily build your own input format as well by extending the basic InputFormat class.

Using HDFS brings certain advantages, however. The most potent advantage is that the MapReduce job scheduler will attempt to execute maps and reduces on the physical machines that are storing the records in need of processing. This brings a performance boost as data can be loaded straight from the local disk instead of transferred over the network, which depending on the connection may be orders of magnitude slower.

As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.

If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.

The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.

A couple resources you might find useful for more info on Hadoop:

Hadoop Summit Videos and Presentations

Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).

Hadoop: The Definitive Guide: Rough Cuts Version

Parallel/ Distributed computing = SPEED << Hadoop makes this really really easy and cheap since you can just use a bunch of commodity machines!!!

Over the years disk storage capacities have increased massively but the speeds at which you read the data have not kept up. The more data you have on one disk, the slower the seeks.

Hadoop is a clever variant of the divide an conquer approach to problem solving. You essentially break the problem into smaller chunks and assign the chunks to several different computers to perform processing in parallel to speed things up rather than overloading one machine. Each machine processes its own subset of data and the result is combined in the end. Hadoop on a single node isn't going to give you the speed that matters.

To see the benefit of hadoop, you should have a cluster with at least 4 - 8 commodity machines (depending on the size of your data) on a the same rack.

You no longer need to be a super genius parallel systems engineer to take advantage of distributed computing. Just know hadoop with Hive and your good to go.

yes, hadoop can be very well used without HDFS. HDFS is just a default storage for Hadoop. You can replace HDFS with any other storage like databases. HadoopDB is an augmentation over hadoop that uses Databases instead of HDFS as a data source. Google it, you will get it easily.

If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.

Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.

Great theoretical answers above.

To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.

 <property>
    <name>fs.defaultFS</name>
    <value>file:///</value>
  </property>

for hadoop versions 1.x.x.

 <property>
    <name>fs.default.name</name>
    <value>file:///</value>
  </property>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow