Question

Upfront my question is: Are there any standard/common methods for implementing a software package that maintains and updates a MySQL database?

I'm an undergraduate research assistant and I've been tasked with creating a cron job that updates one of our university's in house bioinformatics databases.

Instead of building on monolithic binary that does every aspect of the work, I've divided the problem into subtasks and written a few python/c++ modules to handle the different tasks, as listed in the pipeline below:

  • Query the remote database for a list of updated files and return the result for the given time interval (monthly updated files / weekly / daily);
    • Module implemented in python. URL of updated file(s) output to stdout
  • Read in relative URL's of updated files and download to local directory
    • Module implemented in python
  • Unzip each archive of new files
    • Implemented as bash script
  • Parse files into CSV format
    • Module implemented in C++
  • Run MySQL query to insert CSV files into database
    • Obviously just a bash script

I'm not sure how to go about combining these modules into one package that can be easily moved to another machine, say if our current servers run out of space and the DB needs to be copied to another file-system (It's already happened once before).

My first thought is to create a bash script that pipes all of these modules together given that they all operate with stdin/stdout anyway, but this seems like an odd way of doing things.

Alternatively, I could write my C++ code as a python extension, package all of these scripts together and just write one python file that does this work for me.

Should I be using a package manager so that my code is easily installed on different machines? Does a simple zip archive of the entire updater with an included makefile suffice?

I'm extremely new to database management, and I don't have a ton of experience with distributing software, but I want to do a good job with this project. Thanks for the help in advance.

Was it helpful?

Solution

Inter-process communication (IPC) is a standard mechanism of composing many disparate programs into a complex application. IPC includes piping the output of one program to the input of another, using sockets (e.g. issuing HTTP requests from one application to another or sending data via TCP streams), using named FIFOs, and other mechanisms. In any event, using a Bash script to combine these disparate elements (or similarly, writing a Python script that accomplishes the same thing with the subprocess module) would be completely reasonable. The only thing that I would point out with this approach is that, since you are reading/writing to/from a database, you really do need to consider security/authentication with this approach (e.g. can anyone who can invoke this application write to the database? How do you verify that the caller has the appropriate access).

Regarding distribution, I would say that the most important thing is to ensure that you can find -- at any given version and prior versions -- a snapshot of all components and their dependencies at the versions that they were at the time of release. You should set up a code repository (e.g. on GitHub or some other service that you trust), and create a release branch at the time of each release that contains a snapshot of all the tools at the time of this release. That way if, God forbid, the one and only machine in which you have installed the tools fails, you will still be able to instantly grab a copy of the code and install it on a new machine (and if something breaks, you will be able to go back to an earlier release and binary search until you find out where the breakage happened)

In terms of installation, it really depends on how many steps are involved. If it is as simple as unzipping a folder and ensuring that the folder is in your PATH environment variable, then it is probably not worth the hassle to create any special distribution mechanism (unless you are able to do so easily). What I would recommend, though, is clearly documenting the installation steps in the INSTALL or README documentation in the repository (so that the instructions are snapshotted) as well as on the website for your repository. If the number of steps is small and easy to accomplish, then I wouldn't bother with much more. If there are many steps involved (like downloading and installing a large number of dependencies), then I would recommend writing a script that can automate the installation process. That being said, it's really about what the University wants in this case.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top