Is Python suitable for big data

https://datascience.stackexchange.com/questions/778

16-10-2019
|

Question

I read in this post Is the R language suitable for Big Data that big data constitutes 5TB, and while it does a good job of providing information about the feasibility of working with this type of data in R it provides very little information about Python. I was wondering if Python can work with this much data as well.

Solution

To clarify, I feel like the original question references by OP probably isn't be best for a SO-type format, but I will certainly represent python in this particular case.

Let me just start by saying that regardless of your data size, python shouldn't be your limiting factor. In fact, there are just a couple main issues that you're going to run into dealing with large datasets:

Reading data into memory - This is by far the most common issue faced in the world of big data. Basically, you can't read in more data than you have memory (RAM) for. The best way to fix this is by making atomic operations on your data instead of trying to read everything in at once.
Storing data - This is actually just another form of the earlier issue, by the time to get up to about 1TB, you start having to look elsewhere for storage. AWS S3 is the most common resource, and python has the fantastic boto library to facilitate leading with large pieces of data.
Network latency - Moving data around between different services is going to be your bottleneck. There's not a huge amount you can do to fix this, other than trying to pick co-located resources and plugging into the wall.

OTHER TIPS

There are couple of things you need to understand when dealing with Big data -

What is Big data?

You might be aware of famous V's of Big data - Volume, Velocity, Variety... So, Python may not be suitable for all. And it goes with all data science tools available. You need to know which tool is good for what purpose.

If dealing with large Volume of data:

Pig/Hive/Shark - Data cleaning and ETL work
Hadoop/Spark - Distributed parallel computing
Mahout/ML-Lib - Machine Learning

Now, you can use R/Python in intermediate stages but you'll realize that they become bottleneck in your entire process.

If dealing with Velocity of data:

Kafka/Storm - High throughput system

People are trying to R/Python here but again it depends on kind of parallelism you want and your model complexity.

What sort of analysis you wish to do?

If your model demands the entire data to be first brought into memory then your model should not be complex because if the intermediate data is large then the code will break. And if you think of writing it into disk then you'll face additional delay because disk read/write is slow as compared to RAM.

Conclusion

You can definitely use Python in Big data space (Definitely, since people are trying with R, why not Python) but know your data and business requirement first. There may be better tools available for same and always remember:

Your tools shouldn’t determine how you answer questions. Your questions should determine what tools you use.

Python has some very good tools for working with big data:

numpy

Numpy's memmory-mapped arrays let you access a file saved on disk as though it were an array. Only the parts of the array you are actively working with need to be loaded into memory. It can be used pretty much the same as an ordinary array.

h5py and pytables

These two libraries provide access to HDF5 files. These files allow access to just part of the data. Further, thanks to the underlying libraries used to access the data, many mathematical operations and other manipulations of the data can be done without loading it into a python data structure. Massive, highly structured files are possible, much bigger than 5 TB. It also allows seamless, lossless compression.

databases

There are various types of databases that allow you to store big data sets and load just the parts you need. Many databases allow you to do manipulations without loading the data into a python data structure at all.

pandas

This allows higher-level access to various types of data, including HDF5 data, csv files, databases, even websites. For big data, it provides wrappers around HDF5 file access that makes it easier to do analysis on big data sets.

mpi4py

This is a tool for running your python code in a distributed way across multiple processors or even multiple computers. This allows you to work on parts of your data simultaneously.

dask

It provides a version of the normal numpy array that supports many of the normal numpy operations in a multi-core manner that can work on data too large to fit into memory.

blaze

A tool specifically designed for big data. It is basically a wrapper around the above libraries, providing consistent interfaces to a variety of different methods of storing large amounts of data (such as HDF5 or databases) and tools to make it easy to manipulate, do mathematical operations on, and analyze data that is too big to fit into memory.

Absolutely. When you're working with data at that scale it's common to use a big data framework, in which case python or whatever language you're using is merely an interface. See for example Spark's Python Programming Guide. What kind of data do you have and what do you want to do with it?

To handle such amount of data, programming language is not the main concern but the programming framework is. Frameworks such as MapReduce or Spark have bindings to many languages including Python. These frameworks certainly have many ready-to-use packages for data analysis tasks. But in the end it all comes to your requirement, i.e., what is your task? People have different definitions of data analysis tasks, some of them can be easily solved with relational databases. In that case, SQL is much better than all other alternatives.

I believe the language itself has little to do with performance capabilities, when it comes to large data. What matters is:

How large is the data actually
What processing are you going to perform on it
What hardware are you going to use
Which are the specific libraries that you plan to use

Anyway, Python is well adopted in data science communities.

I've been using Anaconda Python 3.4 and Pandas to search 10M row database to match 20K of login credentials. Takes about a minute. The pandas internals make great use of memory. That said, truly big data requires a processing architecture matched to the problem. Pandas is just the glue (logic) in this equation, and other tools can do this as well. R, Scala, Haskell, SAS, etc. can replicate some of the logic - perhaps just enough to answer questions faster. But python makes a good (best?) general-purpose tool. You can run R code in python, as well as most other languages. Although interpretive, there are high performance techniques and tools such as pypy that can make python run almost as fast as benchmark tools with only slightly more effort. And python has many libraries that do just about everything - see above list.

If you are asking if you should learn and use python, my answer is yes Articles indicate that python is used more than R among people who use both. But few data science problems are solved by a single tool. It may become your go-to tool, but its only that - a tool. And just as no sane person builds a house with just a hammer, no sane Data Scientist uses just one tool.

It's funny how people mix big data with data science and business intelligence.

First, big data means "a lot of data", so much information that it doesn't fit in a conventional database. However, sometimes big data is not even proper "value" information but documents, images and so on.

So, to process big data, WE NEED SPEED. Python is out of the league, so R. However, if the task is as easy as to take a CSV and insert into a database, then it's ETL, we don't need programming to do that.

And when the information is reduced, then we could apply python, r or whatever you want to. Even Excel. However, in this stage, Big Data is not big anymore but conventional data.

IMHO, Java is more suitable for Big Data (for the whole chain) but people take Python as default for some unpractical reason.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange