Numpy histogram of large arrays

https://stackoverflow.com/questions/2464871

20-09-2019
|

Question

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.

Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.

Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.

Solution

As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
    d = np.random.randn(1000,1)
    htemp, jnk = np.histogram(d, mybins)
    myhist += htemp

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestion of a generator seems like a good way to address that problem.

OTHER TIPS

Here's a way to bin your values directly:

import numpy as NP

column_of_values = NP.random.randint(10, 99, 10)

# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])

binned_values = NP.digitize(column_of_values, bins)

'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.

'bincount' will give you (obviously) the bin counts:

NP.bincount(binned_values)

Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:

data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
  for i in range(0, data_array.shape[1]) :
    yield dx[:,i]

Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)

I'm posting a second answer to the same question since this approach is very different, and addresses different issues.

What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.

For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.

For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".

I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.

I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)

More on Fenwick Trees:

Binning with Generators (large dataset; fixed-width bins; float data)

If you know the width of your desired bins ahead of time -- even if there are hundreds or thousands of buckets -- then I think rolling your own solution would be fast (both to write, and to run). Here's some Python that assumes you have a iterator that gives you the next value from the file:

from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
   binname = int(floor(val/binwidth)*binwidth)
   if binname not in counts:
      counts[binname] = 0
   counts[binname] += 1
print counts

The values can be floats, but this is assuming you use an integer binwidth; you may need to tweak this a bit if you want to use a binwidth of some float value.

As for next_value_from_file(), as mentioned earlier, you'll probably want to write a custom generator or object with an iter() method do do this efficiently. The pseudocode for such a generator would be this:

def next_value_from_file(filename):
  f = open(filename)
  for line in f:
     # parse out from the line the value or values you need
     val = parse_the_value_from_the_line(line)
     yield val

If a given line has multiple values, then make parse_the_value_from_the_line() either return a list or itself be a generator, and use this pseudocode:

def next_value_from_file(filename):
  f = open(filename)
  for line in f:
     for val in parse_the_values_from_the_line(line):
       yield val

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow