Question

I have a problem sorting list items to bins. I have two lists, X and Y, with corresponding X and Y values (which could also be one list of tuples, obviously). Next, I need to split the X range in 10 equal bins and sort the X values and corresponding Y values to those bins, so that I know what Y values belong to which X bin (i.e. into which bin falls the X value of each Y value), and then take the median of all Y values in each bin. This gives me ten bin-median pairs. This is working fine in principle with the following code in which I also calculate the X-center of each bin.

    bins = np.linspace(max(X), min(X), 10)
    digitized = np.digitize(X, bins)
    bin_centers = []
    for j in range(len(bins) - 1):
        bin_centers.append((bins[j] + bins[j + 1]) / 2.)
    bin_means = [np.median(np.asarray(Y)[digitized == j])
                 for j in range(1, len(bins))]

The problem now is that sometimes a bin is empty since there is no X-value in this bin. In this case the line

    bin_means = [np.median(np.asarray(Y)[digitized == j])
                 for j in range(1, len(bins))]

raises the error

/usr/lib64/python2.6/site-packages/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
FloatingPointError: invalid value encountered in double_scalars

because of the empty bin. How can I fix that? I also tried right=True/False in numpy.digitize with no luck. I think it would be best to delete the entries in the three lists bin_centers, in digitized, and bins before doing this list comprehension that calculates the median values. But I'm not sure how to do that, how to find out which bins are empty and then what has to be deleted from those lists and how. Any ideas? Thanks!

Was it helpful?

Solution

If you have Scipy, you could call scipy.stats.binned_statistic:

import scipy.stats as stats
statistic, bin_edges, binnumber = stats.binned_statistic(
    x=X, values=Y, statistic='median', bins=bins)
statistic = statistic[np.isfinite(statistic)]
print(statistic)

yields

[ 15.  90.  50.  55.  40.  60.]

Without SciPy, I think you would need a list comprehension. As you suggested, you could avoid the RuntimeWarning by filtering out those bins which are empty. You can do that with an if-condition inside a list comprehension:

masks = [(digitized == j) for j in range(1, len(bins))]
bin_medians = [np.median(Y[mask]) for mask in masks if mask.any()]

Also note that the error message you are seeing is a warning, not an Exception. You could (alternatively) suppress the error message with

import warnings
warnings.filterwarnings("ignore", 'Mean of empty slice.')
warnings.filterwarnings("ignore", 'invalid value encountered in double_scalar')

There is a way to compute the bin_centers more quickly:

bin_centers = []
for j in range(len(bins) - 1):
    bin_centers.append((bins[j] + bins[j + 1]) / 2.)

could be simplified to

bin_centers = bins[:-1] + (bins[1]-bins[0])/2

So, for example,

import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", 'Mean of empty slice.')
warnings.filterwarnings("ignore", 'invalid value encountered in double_scalar')

np.random.seed(123)

X = np.random.random(10)
bins = np.linspace(min(X), max(X), 10)
digitized = np.digitize(X, bins)-1
bin_centers = bins + (bins[1]-bins[0])/2

Y = range(0, 100, 10)
Y = np.asarray(Y, dtype='float')
bin_medians = [np.median(Y[digitized == j]) for j in range(len(bins))]
print(bin_medians)

plt.scatter(bin_centers, bin_medians)
plt.show()

yields

[15.0, 90.0, 50.0, 55.0, nan, 40.0, nan, nan, nan, 60.0]

enter image description here

If your purpose is only to make the scatter plot, then it is not necessary to remove the nans since matplotlib will ignore them anyway.

If you really want to remove the nans, then you could use

no_nans = np.isfinite(bin_medians)
bin_medians = bin_medians[no_nans]
bin_centers = bin_centers[no_nans]

In the above, I opted for using warnings.filterwarnings to just suppress the warnings. If you don't wish to suppress warnings, and would rather filter the nans from bin_medians and from the corresponding locations from bin_centers, then:

bin_centers = bins + (bins[1]-bins[0])/2
masks = [(digitized == j) for j in range(len(bins))]
bin_centers, bin_medians = zip(*[(center, np.median(Y[mask]))
                                 for center, mask in zip(bin_centers, masks)
                                 if mask.any()])

OTHER TIPS

I don't quite understand the question, but here's something to maybe get you started:

In [3]: X = [1,2,3,4,5,6,7,8,9,10]

In [4]: Y = [chr(96+x) for x in X]

In [8]: Z = zip(X, Y)    # Create a pairing - this can be done after a sort if they're not in whatever 'order' you want for your correspondence

In [9]: Z
Out[9]:
[(1, 'a'),
 (2, 'b'),
 (3, 'c'),
 (4, 'd'),
 (5, 'e'),
 (6, 'f'),
 (7, 'g'),
 (8, 'h'),
 (9, 'i'),
 (10, 'j')]

At this point you can do something like sorted(Z, key=lambda el: -ord(el[1])) or whatever to sort based on your criteria. Obviously it'd be more meaningful than the example.

Finally, to chunk into equal-length parts, which I think you might also want, take a look at the wide variety of possibilities given as answers here.

If that's not what you were looking for, apologies.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top