Do I need to learn Hadoop to be a Data Scientist?

https://datascience.stackexchange.com/questions/253

16-10-2019
|

Question

An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?

Solution

Different people use different tools for different things. Terms like Data Science are generic for a reason. A data scientist could spend an entire career without having to learn a particular tool like hadoop. Hadoop is widely used, but it is not the only platform that is capable of managing and manipulating data, even large scale data.

I would say that a data scientist should be familiar with concepts like MapReduce, distributed systems, distributed file systems, and the like, but I wouldn't judge someone for not knowing about such things.

It's a big field. There is a sea of knowledge and most people are capable of learning and being an expert in a single drop. The key to being a scientist is having the desire to learn and the motivation to know that which you don't already know.

As an example: I could hand the right person a hundred structured CSV files containing information about classroom performance in one particular class over a decade. A data scientist would be able to spend a year gleaning insights from the data without ever needing to spread computation across multiple machines. You could apply machine learning algorithms, analyze it using visualizations, combine it with external data about the region, ethnic makeup, changes to environment over time, political information, weather patterns, etc. All of that would be "data science" in my opinion. It might take something like hadoop to test and apply anything you learned to data comprising an entire country of students rather than just a classroom, but that final step doesn't necessarily make someone a data scientist. And not taking that final step doesn't necessarily disqualify someone from being a data scientist.

OTHER TIPS

As a former Hadoop engineer, it is not needed but it helps. Hadoop is just one system - the most common system, based on Java, and a ecosystem of products, which apply a particular technique "Map/Reduce" to obtain results in a timely manner. Hadoop is not used at Google, though I assure you they use big data analytics. Google uses their own systems, developed in C++. In fact, Hadoop was created as a result of Google publishing their Map/Reduce and BigTable (HBase in Hadoop) white papers.

Data scientists will interface with hadoop engineers, though at smaller places you may be required to wear both hats. If you are strictly a data scientist, then whatever you use for your analytics, R, Excel, Tableau, etc, will operate only on a small subset, then will need to be converted to run against the full data set involving hadoop.

You have to first make it clear what do you mean by "learn Hadoop". If you mean using Hadoop, such as learning to program in MapReduce, then most probably it is a good idea. But fundamental knowledge (database, machine learning, statistics) may play a bigger role as time goes on.

Yes, you should learn a platform that is capable of dissecting your problem as a data parallel problem. Hadoop is one. For your simple needs (design patterns like counting, aggregation, filtering etc.) you need Hadoop and for more complex Machine Learning stuff like doing some Bayesian, SVM you need Mahout which in turn needs Hadoop (Now Apache Spark) to solve your problem using a data-parallel approach.

So Hadoop is a good platform to learn and really important for your batch processing needs. Not only Hadoop but you also need to know Spark (Mahout runs it's algorithms utilizing Spark) & Twitter Storm (for your real time analytics needs). This list will continue and evolve so if you are good with the building blocks (Distributed Computing, Data-Parallel Problems and so on) and know how one such platform (say Hadoop) operates you will fairly quickly be up to speed on others.

It strongly depends on the environment/company you are working with. In my eyes there is a "big data" hype at the moment and a lot of companies try to enter the field with hadoop based solutions - what makes hadoop also a buzzword but its not always the best solution.

In my mind, a good Data Scientist should be able to ask the right questions and keep on asking again until its clear whats really needed. Than a good DataScientist - of course - needs to know how to address the problem (or at least know someone who can). Otherwise your stakeholder could be frustrated :-)

So, i would say its not absolutely necessary to learn Hadoop.

You should learn Hadoop if you want to be work as data scientist, but maybe before starting with Hadoop you should read something about ETL or Big Data... this book could be a good starting point: http://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343

Hope it helps and good luck!

You can apply data science techniques to data on one machine so the answer to the question as the OP phrased it, is no.

Data Science is a field demanding a variety of skills. Having knowledge of Hadoop is one of them. The main tasks of a Data Scientist include:

Gathering data from different resources.
Cleaning and pre-processing the data.
Studying statistical properties of the data.
Using Machine Learning techniques to do forecasting and derive insights from the data.
Communicating the results to decision makers in an easy to understand way.

Out of the above points knowledge of Hadoop is useful for points 1,2 and 3, but you also need to have strong mathematical/statistical background and strong knowledge of Computational techniques to work in data science field. Also Hadoop is not the only framework that is being used in Data Science. Big Data ecosystem has a range of frameworks, each specific to a particular use case. This article gives introductory material regarding major Big Data frameworks that could be used in Data Science:

http://www.codophile.com/big-data-frameworks-every-programmer-should-know/

I do think Leaning Hadoop framework (hard way) is not a requirement of being a Data Scientist. General knowledge on all big data platforms is essential. I will suggest to know concept on it and only part need from Hadoop is the MapReduce http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

A Data Scientist does not build cluster, administer ... is just make "magic" with data and does not care where is coming from. The term "Hadoop" has come to refer not just to the base modules above, but also to the "ecosystem", or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.

Most important is the Programing language, math and statistics for working with data (you'll need to find a way to connect with data and move forward). I wish I had somebody to point me to the concept and do not spend weeks on learning framework and build from scratch nodes and clusters, because that part is Administrator role and not Data Engineer or Data Scientist. Also one thing: all are changing and evolving but math, programing, statistics are still the requirements.

accessing data from hdfs is essential, for example PROC Hadoop, Hive, SparkContext or any other driver or pipe (treat hadoop as a point of accesing data or storage :)

already are in place tools or frameworks what take care of resource allocation and management, performance.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange