Question

It seems as though most languages have some number of scientific computing libraries available.

  • Python has Scipy
  • Rust has SciRust
  • C++ has several including ViennaCL and Armadillo
  • Java has Java Numerics and Colt as well as several other

Not to mention languages like R and Julia designed explicitly for scientific computing.

With so many options how do you choose the best language for a task? Additionally which languages will be the most performant? Python and R seem to have the most traction in the space, but logically a compiled language seems like it would be a better choice. And will anything ever outperform Fortran? Additionally compiled languages tend to have GPU acceleration, while interpreted languages like R and Python don't. What should I take into account when choosing a language, and which languages provide the best balance of utility and performance? Also are there any languages with significant scientific computing resources that I've missed?

Was it helpful?

Solution

This is a pretty massive question, so this is not intended to be a full answer, but hopefully this can help to inform general practice around determining the best tool for the job when it comes to data science. Generally, I have a relatively short list of qualifications I look for when it comes to any tool in this space. In no particular order they are:

  • Performance: Basically boils down to how quickly the language does matrix multiplication, as that is more or less the most important task in data science.
  • Scalability: At least for me personally, this comes down to ease of building a distributed system. This is somewhere where languages like Julia really shine.
  • Community: With any language, you're really looking for an active community that can help you when you get stuck using whichever tool you're using. This is where python pulls very far ahead of most other languages.
  • Flexibility: Nothing is worse than being limited by the language that you use. It doesn't happen very often, but trying to represent graph structures in haskell is a notorious pain, and Julia is filled with a lot of code architectures pains as a result of being such a young language.
  • Ease of Use: If you want to use something in a larger environment, you want to make sure that setup is a straightforward and it can be automated. Nothing is worse than having to set up a finnicky build on half a dozen machines.

There are a ton of articles out there about performance and scalability, but in general you're going to be looking at a performance differential of maybe 5-10x between languages, which may or may not matter depending on your specific application. As far as GPU acceleration goes, cudamat is a really seamless way of getting it working with python, and the cuda library in general has made GPU acceleration far more accessible than it used to be.

The two primary metrics I use for both community and flexibility are to look at the language's package manager, and the language questions on a site like SO. If there are a large number of high-quality questions and answers, it's a good sign that the community is active. Number of packages and the general activity on those packages can also be a good proxy for this metric.

As far as ease of use goes, I am a firm believer that the only way to actually know is to actually set it up yourself. There's a lot of superstition around a lot of Data Science tools, specifically things like databases and distributed computing architecture, but there's no way to really know if something is easy or hard to setup up and deploy without just building it yourself.

OTHER TIPS

The best language depends on what you want to do. First remark: don't limit yourself to one language. Learning a new language is always a good thing, but at some point you will need to choose. Facilities offered by the language itself are an obvious thing to keep into account but in my opinion the following are more important:

  • available libraries: do you have to implement everything from scratch or can you reuse existing stuff? Note that this these libraries need not be in whatever language you are considering, as long as you can interface easily. Working in a language without library access won't help you get things done.
  • number of experts: if you want external developers or start working in a team, you have to consider how many people actually know the language. As an extreme example: if you decide to work in Brainfuck because you happen to like it, know that you will likely work alone. Many surveys exists that can help assess the popularity of languages, including the number of questions per language on SO.
  • toolchain: do you have access to good debuggers, profilers, documentation tools and (if you're into that) IDEs?

I am aware that most of my points favor established languages. This is from a 'get-things-done' perspective.

That said, I personally believe it is far better to become proficient in a low level language and a high level language:

  • low level: C++, C, Fortran, ... using which you can implement certain profiling hot spots only if you need to because developing in these languages is typically slower (though this is subject to debate). These languages remain king of the hill in terms of critical performance and are likely to stay on top for a long time.
  • high level: Python, R, Clojure, ... to 'glue' stuff together and do non-performance critical stuff (preprocessing, data handling, ...). I find this to be important simply because it is much easier to do rapid development and prototyping in these languages.

First you need to decide what you want to do, then look for the right tool for that task.

A very general approach is to use R for first versions and to see if your approach is correct. It lacks a little in speed, but has very powerful commands and addon libraries, that you can try almost anything with it: http://www.r-project.org/

The second idea is if you want to understand the algorithms behind the libraries, you might wanna take a look at the Numerical Recipies. They are available for different languages and free to use for learning. If you want to use them in commercial products, you need to ourchase a licence: http://en.wikipedia.org/wiki/Numerical_Recipes

Most of the time performance will not be the issue but finding the right algorithms and parameters for them, so it is important to have a fast scripting language instead of a monster program that first needs to compile 10 mins before calculating two numbers and putting out the result.

And a big plus in using R is that it has built-in functions or libraries for almost any kind of diagram you might wanna need to visualize your data.

If you then have a working version, it is almost easy to port it to any other language you think is more performant.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top