Question

I am currently writing a package to streamline data analysis for a research lab. There are several different analysis software packages that we use, based out of unix, matlab, and (rarely used) python. A typical data set is about 250GB (raw), and requires at least 4 different preprocessing steps before analysis. The finished product typically ends up taking up about 1TB. The goal of my package is to allow the user to pick and choose which existing package to use for each step before running the analysis, and then the program will execute it without further user intervention. Since the goal is to integrate these different packages, written in different languages, I decided to write the program in bash to make it easy to call the actual analysis scripts no matter what language they are written in.

The program is starting to come along, but it is getting very complex because of the various idiosyncratic expectations and conventions of each analysis package. I realize bash may not be the most suitable language for complex tasks, but I like that it's easy to call scripts in different languages from there, and that it's relatively simple. The program also does a lot of file handling, which bash is good at. On the other hand, I hear it's also very slow, and it gets clunky when things get more complicated.

I'm wondering if bash is the best choice for this task. Does anyone have suggestions for other languages, or combinations of languages, that might be better suited to my needs?

I should note that I am a self-taught programmer and this is my first real programming challenge. I am mostly familiar with bash, matlab, R, and a little bit of python, but I'd like to learning new things too (C maybe?). Also, this is all going to run on unix.

Was it helpful?

Solution

If you are mostly stitching up together calls into other software, like unix utilities (awk, grep, sed ...), python, and matlab scripts, bash is just fine or possibly even best for the job to construct simple pipelines and workflows.

It's easy in bash to read user input, store it in variables, then launch other software depending on the set variables. It's perfectly fast enough for that, and nothing else gets any easier.

If you, however, were to use bash for preprocessing itself, like looping through files line by line, packing and unpacking tab-separated values into arrays etc., that would be excruciatingly slow and not recommended.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top