Question

I am brand new to the field of data science, want to break into it, and there are so many tools out there. These VMs have a lot of software on them, but I haven't been able to find any side-by-side comparison.

Here's a start from my research, but if someone could tell me that one is objectively more rich-featured, with a larger community of support, and useful to get started then that would help greatly:

datasciencetoolKIT.org -> vm is on vagrant cloud (4 GB) and seems to be more "hip" with R, iPython notebook, and other useful command-line tools (html->txt, json->xml, etc). There is a book being released in August with detail.

datasciencetoolBOX.org -> vm is a vagrant box (24 GB) downloadable from their website. There seems to be more features here, and more literature.

Was it helpful?

Solution

Do you need a VM?

You need to keep in mind that a virtual machine is a software emulation of your own or another machine hardware configuration that can run an operating systems. In most basic terms, it acts as a layer interfacing between the virtual OS, and your own OS which then communicates with the lower level hardware to provide support to the virtual OS. What this means for you is:

Cons

Hardware Support

A drawback of virtual machine technology is that it supports only the hardware that both the virtual machine hypervisor and the guest operating system support. Even if the guest operating system supports the physical hardware, it sees only the virtual hardware presented by the virtual machine. The second aspect of virtual machine hardware support is the hardware presented to the guest operating system. No matter the hardware in the host, the hardware presented to the guest environment is usually the same (with the exception of the CPU, which shows through). For example, VMware GSX Server presents an AMD PCnet32 Fast Ethernet card or an optimized VMware-proprietary network card, depending on which you choose. The network card in the host machine does not matter. VMware GSX Server performs the translation between the guest environment's network card and the host environment's network card. This is great for standardization, but it also means that host hardware that VMware does not understand will not be present in the guest environment.

Performance Penalty

Virtual machine technology imposes a performance penalty from running an additional layer above the physical hardware but beneath the guest operating system. The performance penalty varies based on the virtualization software used and the guest software being run. This is significant.

Pros

Isolation

One of the key reasons to employ virtualization is to isolate applications from each other. Running everything on one machine would be great if it all worked, but many times it results in undesirable interactions or even outright conflicts. The cause often is software problems or business requirements, such as the need for isolated security. Virtual machines allow you to isolate each application (or group of applications) in its own sandbox environment. The virtual machines can run on the same physical machine (simplifying IT hardware management), yet appear as independent machines to the software you are running. For all intents and purposes—except performance, the virtual machines are independent machines. If one virtual machine goes down due to application or operating system error, the others continue running, providing services your business needs to function smoothly.

Standardization

Another key benefit virtual machines provide is standardization. The hardware that is presented to the guest operating system is uniform for the most part, usually with the CPU being the only component that is "pass-through" in the sense that the guest sees what is on the host. A standardized hardware platform reduces support costs and increases the share of IT resources that you can devote to accomplishing goals that give your business a competitive advantage. The host machines can be different (as indeed they often are when hardware is acquired at different times), but the virtual machines will appear to be the same across all of them.

Ease of Testing

Virtual machines let you test scenarios easily. Most virtual machine software today provides snapshot and rollback capabilities. This means you can stop a virtual machine, create a snapshot, perform more operations in the virtual machine, and then roll back again and again until you have finished your testing. This is very handy for software development, but it is also useful for system administration. Admins can snapshot a system and install some software or make some configuration changes that they suspect may destabilize the system. If the software installs or changes work, then the admin can commit the updates. If the updates damage or destroy the system, the admin can roll them back. Virtual machines also facilitate scenario testing by enabling virtual networks. In VMware Workstation, for example, you can set up multiple virtual machines on a virtual network with configurable parameters, such as packet loss from congestion and latency. You can thus test timing-sensitive or load-sensitive applications to see how they perform under the stress of a simulated heavy workload.

Mobility

Virtual machines are easy to move between physical machines. Most of the virtual machine software on the market today stores a whole disk in the guest environment as a single file in the host environment. Snapshot and rollback capabilities are implemented by storing the change in state in a separate file in the host information. Having a single file represent an entire guest environment disk promotes the mobility of virtual machines. Transferring the virtual machine to another physical machine is as easy as moving the virtual disk file and some configuration files to the other physical machine. Deploying another copy of a virtual machine is the same as transferring a virtual machine, except that instead of moving the files, you copy them.

Which VM should I use if I am starting out?

The Data Science Box or the Data Science Toolbox are your best bets if you just getting into data science. They have the basic software that you will need, with the primary difference being the virtual environment in which each of these can run. The DSB can run on AWS while the DST can run on Virtual Box (which is the most common tool used for VMs).

Sources

OTHER TIPS

In most cases a practicing data scientist creates his own working environment on personal computed installing preferred software packages. Normally it is sufficient and efficient use of computing resources, because to run a virtual machine (VM) on your main machine you have to allocate a significant portion of RAM for it. The software will run noticeably slower on both the main and the virtual machine unless a lot of RAM.

Due to this impact on speed it is not common to use VMs as main working environment but they are a good solution in several cases when there is a need of additional working environment.

The VMs be considered when:

  1. There is a need to easily replicate a number of identical computing environments when teaching a course or doing a presentation on a conference.
  2. There is a need to save and recreate an exact environment for an experiment or a calculation.
  3. There is a need to run a different OS or to test a solution on a tool that runs on a different OS.
  4. One wants to try out a bundle of software tools before installing them on the main machine. E.g. there is an opportunity to instal an instance of Hadoop (CDH) on a VM during an Intro to Hadoop course on Udacity.
  5. VMs are sometimes used for fast deployment in the cloud like AWS EC, Rackspace etc.

The VMs mentioned in the original question are made as easily installable data science software bundles. There are more than these two. This blog post by Jeroen Janssens gives a comparison of at least four:

  1. Data Science Toolbox
  2. Mining the Social Web
  3. Data Science Toolkit
  4. Data Science Box
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top