Data Science in C (or C++)

https://datascience.stackexchange.com/questions/5357

16-10-2019
|

Question

I'm an R language programmer. I'm also in the group of people who are considered Data Scientists but who come from academic disciplines other than CS.

This works out well in my role as a Data Scientist, however, by starting my career in R and only having basic knowledge of other scripting/web languages, I've felt somewhat inadequate in 2 key areas:

Lack of a solid knowledge of programming theory.
Lack of a competitive level of skill in faster and more widely used languages like C, C++ and Java, which could be utilized to increase the speed of the pipeline and Big Data computations as well as to create DS/data products which can be more readily developed into fast back-end scripts or standalone applications.

The solution is simple of course -- go learn about programming, which is what I've been doing by enrolling in some classes (currently C programming).

However, now that I'm starting to address problems #1 and #2 above, I'm left asking myself "Just how viable are languages like C and C++ for Data Science?".

For instance, I can move data around very quickly and interact with users just fine, but what about advanced regression, Machine Learning, text mining and other more advanced statistical operations?

So. can C do the job -- what tools are available for advanced statistics, ML, AI, and other areas of Data Science? Or must I loose most of the efficiency gained by programming in C by calling on R scripts or other languages?

The best resource I've found thus far in C is a library called Shark, which gives C/C++ the ability to use Support Vector Machines, linear regression (not non-linear and other advanced regression like multinomial probit, etc) and a shortlist of other (great but) statistical functions.

Solution

Or must I loose most of the efficiency gained by programming in C by calling on R scripts or other languages?

Do the opposite: learn C/C++ to write R extensions. Use C/C++ only for the performance critical sections of your new algorithms, use R to build your analysis, import data, make plots etc.

If you want to go beyond R, I'd recommend learning python. There are many libraries available such as scikit-learn for machine learning algorithms or PyBrain for building Neural Networks etc. (and use pylab/matplotlib for plotting and iPython notebooks to develop your analyses). Again, C/C++ is useful to implement time critical algorithms as python extensions.

OTHER TIPS

As Andre Holzner has said, extending R with C/C++ extension is a very good way to take advantage of the best of both sides. Also you can try the inverse , working with C++ and ocasionally calling function of R with the RInside package o R. Here you can find how

http://cran.r-project.org/web/packages/RInside/index.html http://dirk.eddelbuettel.com/code/rinside.html

Once you're working in C++ you have many libraries , many of them built up for specific problems, other more general

http://www.shogun-toolbox.org/page/features/ http://image.diku.dk/shark/sphinx_pages/build/html/index.html

http://mlpack.org/

I agree that the current trend is to use Python/R and to bind it to some C/C++ extensions for computationally expensive tasks.

However, if you want to stay in C/C++, you might want to have a look at Dlib:

Dlib is a general purpose cross-platform C++ library designed using contract programming and modern C++ techniques. It is open source software and licensed under the Boost Software License.

In my opinion, ideally, to be a more well-rounded professional, it would be nice to know at least one programming language for the most popular programming paradigms (procedural, object-oriented, functional). Certainly, I consider R and Python as the two most popular programming languages and environments for data science and, therefore, primary data science tools.

Julia is impressive in certain aspects, but it tries to catch up with those two and establish itself as a major data science tool. However, I don't see this happening any time soon, simply due to R/Python's popularity, very large communities as well as enormous ecosystems of existing and newly developed packages/libraries, covering an very wide range of domains / fields of study.

Having said that, many packages and libraries, focused on data science, ML and AI areas, are implemented and/or provide APIs in languages other than R or Python (for the proof, see this curated list and this curated list, both of which are excellent and give a solid perspective about the variety in the field). This is especially true for performance-oriented or specialized software. For that software, I've seen projects with implementation and/or APIs mostly in Java, C and C++ (Java is especially popular in the big data segment of data science - due to its closeness to Hadoop and its ecosystem - and in the NLP segment), but other options are available, albeit to a much more limited, domain-based, extent. Neither of these languages is a waste of time, however you have to prioritize mastering any or all of them with your current work situation, projects and interests. So, to answer your question about viability of C/C++ (and Java), I would say that they are all viable, however not as primary data science tools, but as secondary ones.

Answering your questions on 1) C as a potential data science tool and 2) its efficiency, I would say that: 1) while it's possible to use C for data science, I would recommend against doing it, because you'd have a very hard time finding corresponding libraries or, even more so, trying to implement corresponding algorithms by yourself; 2) you shouldn't worry about efficiency, as many performance-critical segments of code are implemented in low-level languages like C, plus, there are options to interface popular data science languages with, say, C (for example, Rcpp package for integration R with C/C++: http://dirk.eddelbuettel.com/code/rcpp.html). This is in addition to simpler, but often rather effective, approaches to performance, such as consistent use of vectorization in R as well as using various parallel programming frameworks, packages and libraries. For R ecosystem examples, see CRAN Task View "High-Performance and Parallel Computing with R".

Speaking about data science, I think that it makes quite a lot of sense to mention the importance of reproducible research approach as well as the availability of various tools, supporting this concept (for more details, please see my relevant answer). I hope that my answer is helpful.

R is one of the key tool for data scientist, what ever you do don't stop using it.

Now talking about C, C++ or even Java. They are good popular languages. Wether you need them or will need them depend on the type of job or projects you have. From personal experience, there are so many tools out there for data scientist that you will always feel like you constantly need to be learning.

You can add Python or Matlab to things to learn if you want and keep adding. The best way to learn is to take on a work project using other tools that you are not comfortable with. If I were you, I would learn Python before C. It is more used in the community than C. But learning C is not a waste of your time.

As a data scientist the other languages (C++/Java) come in handy when you need incorporate machine learning into an existing production engine.

Waffles is both a well-maintained C++ class library and command-line analysis package. It's got supervised and unsupervised learning, tons of data manipulation tools, sparse data tools, and other things such as audio processing. Since it's also a class library, you can extend it as you need. Even if you are not the one developing the C++ engine (chances are you won't be), this will allow you to prototype, test, and hand something over to the developers.

Most importantly, I believe my knowledge of C++ & Java really help me understand how Python and R work. Any language is only used properly when you understand a little about what is going on underneath. By learning the differences between languages you can learn to exploit the strengths of your main language.

Update

For commercial applications with large data sets, Apache Spark - MLLib is important. Here you can use Scala, Java, or Python.

I would be keen to understand why you would need another language (apart form Python) if your goal is " but what about advanced regression, Machine Learning, text mining and other more advanced statistical operations".
For that kind of thing, C is a waste of time. It's a good tool to have but in the ~20 years since Java came out, I've rarely coded C.
If you prefer the more functional-programming side of R, learn Scala before you get into too many procedural bad habits coding with C.
Lastly learn to use Hadley Wickham's libraries - they'll save you a lot of time doing data manipulation.

There are some C++ tools for statistics and data science like ROOT https://root.cern.ch/drupal/ , BAT https://www.mppmu.mpg.de/bat/ , boost , or OpenCV

Not sure whether it's been mentioned yet, but there's also vowpal wabbit but it might be specific to certain kinds of problem only.

Take a look at Intel DAAL which is underway. It's highly optimised for Intel CPU architecture and supports distributed computations.

Scalable Machine Learning Solutions for Big Data:

I will add my $.02 because there is a key area that seems not to have been addressed in all of the previous posts - machine learning on big data!

For big data, scalability is key, and R is insufficient. Further, languages like Python and R are only useful for interfacing with scalable solutions which are usually written in other languages. I make this distinction not because I want to disparage those using them, but only because its so important for members of the data science community to understand what truly scalable machine learning solutions look like.

I do most of my work with big data on distributed memory clusters. That is, I don't just use one 16 core machine (4 quad core processors on a single motherboard sharing the memory of that motherboard), I use a small cluster of 64 16 core machines. The requirements are very different for these distributed memory clusters than for shared memory environments and big data machine learning requires scalable solutions within distributed memory environments in many cases.

We also use C and C++ everywhere within a proprietary database product. All of our high level stuff is handled in C++ and MPI, but the low level stuff that touches the data is all longs and C style character arrays to keep the product very very fast. The convenience of std strings are simply not worth the computational cost.

There not many C++ libraries available which offer distributed, scalable machine learning capabilities - MLPACK.

However, there are other scalable solutions with APIs:

Apache Spark has a scalable machine learning library called MLib that you can interface with.

Also Tensorflow now has distributed tensorflow and has a C++ api.

Hope this helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange