Question

I see a many times in job description for data scientist asking for Python/Java experience and disregard R. Below is a personal email I received from chief data scientist of a company I applied for through linkedin.

X, Thanks for connecting and expressing interest. You do have good Analytics Skills. However, all our data scientists must have good programming skills in Java/Python as we are a internet/mobile organisation and everything we do is online.

While I respect the decision of the chief data scientist, I am unable to get a clear picture as to what are the tasks that Python can do that R cannot do. Can anyone care to elaborate? I am actually keen to learn Python/Java, provided I get a bit more detail.

Edit: I found an interesting discussion on Quora. Why is Python a language of choice for data scientists?

Edit2: Blog from Udacity on Languages and Libraries for Machine Learning

Was it helpful?

Solution

So you can integrate with the rest of the code base. It seems your company uses a mix of Java and python. What are you going to do if a little corner of the site needs machine learning; pass the data around with a database, or a cache, drop to R, and so on? Why not just do it all in the same language? It's faster, cleaner, and easier to maintain.

Know any online companies that run solely on R? Neither do I...

All that said Java is the last language I'd do data science in.

OTHER TIPS

There may be a lot of reasons like:

  1. Workforce flexibility: One Java / Python programmers can be moved to other tasks or projects easily.

  2. Candidates availability: there are plenty of Java / Python programmers. You do not want to introduce a new programming language to later find out that there are no qualified workers or they are just too expensive.

  3. Integration and ETL: Sometimes getting the data with the right quality is the hardest part of the project. So it is natural to use the same language as the rest of the systems.

  4. Business model definition: Most business rules and business models are already written in this languages.

  5. Just keeping things simple. It is already hard enough to be up-to-date with the technologies. A diverse base of language can be chaotic. R for this, Ruby for that, Scala, Clojure, F#, Swift, Dart... They may need different servers, different pathes, a hell to administer. All have their own IDEs with tools and plugins (not always free). See some Uncle Bob's points about languages choice and new technologies

So even if you have a 5% - 15% productivity advantage using R for the specific task, they may prefer a tool that just does the job even if not in the most efficient way.

It is in general true that for purely data science and statistics exercises R offers the best and fastest (especially if using the data.table package) tools and methods, that otherwise would be heavier to implement in Python (I assume by Python we all mean Pandas, though). Most data scientists do in fact use R to perform their models and calculations, or just to see how data behave.

Once the exercise is complete it is time to make it available to the rest of the people who have to use it (i. e. to deploy); to this aim it is oftentimes preferred to submit the code in Python for two main reasons:

  1. Most architectures are written in Python or are Python-friendly, therefore it would be easier to implement models natively written in that language.
  2. R syntax and grammar is extremely complicated. I myself strongly favour R other than anything else but have to however admit that the syntax is not really straightforward and has a very picked learning curve.

The above said, it is still true that one can easily translate R code into any other language, provided methods, libraries and packages are available (in Python most of them are, so that is no problem at all). Plenty of infrastructures and databases support underlying R code, hence portability is not really a problem, especially if one just has to submit the results of the calculations (to that extend, nobody really sees the underlying code anyway).

Java is of almost no use for the pure data science itself (although the Stanford University has a collection of machine learning NLP libraries written in Java, as far as I remember - but please check). The only reason why it can be required is just that the rest of the company uses it to big extents and they do not want to replace it with something new.

I've seen quite a few companies using the title Data Scientist for "Data Engineer" type roles. Particularly in the big data space.

If the company is using Hadoop or a distributed framework like Spark to do it's analytics in then Java or Python (or probably Scala) would be the languages that would make the most sense .

Java

I'd have to disagree with the other posters on the java question. There are certain noSQL databases (like hadoop) that one needs to write mapreduce jobs in java. Now you can use HIVE to achieve much the same result.

Python

The python / R debate continues. Both are extensible languages, so potentially both could have the same ability to process. I only know R and my python knowledge is quite superficial. Speaking as a small business owner, you want to not have too many tools in your business otherwise there will be a general lack of depth in them, and difficulty supporting them. I think it will come down to depth of tool knowledge in the team. If the team is focused on python, then hiring another python data scientist is going to make sense as they can engage with the existing code base and historic experiment code.

At least for my current team (~80 data scientists and engineers), we don't have such preference. Half of the data scientists here use R and another half use Python. Many can code in both. We do deploy Python and R code in production.

I don't think any of our data scientists uses Java at all. If they need to deal with big data, they can use SparkSQL or PySpark. The data engineering team uses a mix of Java/Scala/Python/Go.

If you are one of few data people in a small company, I can understand why they require certain language skills so you can do both data science and engineering. But tbh, I think most small companies won't have data big enough that Python or R can't handle in production.

My point of view as a general purpose programmer with a tiny bit of R experience: R is excellent for data science, but it's geared towards people manually interpreting data. If you want to use the results for something automated, you have to interface with something else, and that something else will be hard to do in a problem specific language like R. Can you do a web site in R? :) On the other hand, python does have ready made libraries for data sciency stuff and is a general purpose programming language that doesn't get in the way of your doing anything else with it. As for Java, it's good for large programming projects with hundreds of thousands to millions of lines of code. If the data science part needs to interface with that, it may make sense to do everything in Java then.

Random whine: Why do I have to sign in to each StackExchange site separately?

The tools in Python are just better than R. Ther R community is pretty stagnant while the Python community is evolving really quick. Especially in tools for Data Science.
Also Python works way easier with everything around it. You can easily scrape the web, connect to databases and so on. That makes prototyping really fast.
And if you have a working prototype and care to make it faster or integrate it into the company workflow, it gets usually reimplemented in Java.

R has a few neat tools and visualization but it is not that great to build new stuff in it.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top