문제

What is a good way to bin numerical values into a certain range? For example, suppose I have a list of values and I want to bin them into N bins by their range. Right now, I do something like this:

from scipy import *
num_bins = 3 # number of bins to use
values = # some array of integers...
min_val = min(values) - 1
max_val = max(values) + 1
my_bins = linspace(min_val, max_val, num_bins)
# assign point to my bins
for v in values:
  best_bin = min_index(abs(my_bins - v))

where min_index returns the index of the minimum value. The idea is that you can find the bin the point falls into by seeing what bin it has the smallest difference with.

But I think this has weird edge cases. What I am looking for is a good representation of bins, ideally ones that are half closed half open (so that there is no way of assigning one point to two bins), i.e.

bin1 = [x1, x2)
bin2 = [x2, x3)
bin3 = [x3, x4)
etc...

what is a good way to do this in Python, using numpy/scipy? I am only concerned here with binning integer values.

thanks very much for your help.

도움이 되었습니까?

해결책

numpy.histogram() does exactly what you want.

The function signature is:

numpy.histogram(a, bins=10, range=None, normed=False, weights=None, new=None)

We're mostly interested in a and bins. a is the input data that needs to be binned. bins can be a number of bins (your num_bins), or it can be a sequence of scalars, which denote bin edges (half open).

import numpy
values = numpy.arange(10, dtype=int)
bins = numpy.arange(-1, 11)
freq, bins = numpy.histogram(values, bins)
# freq is now [0 1 1 1 1 1 1 1 1 1 1]
# bins is unchanged

To quote the documentation:

All but the last (righthand-most) bin is half-open. In other words, if bins is:

[1, 2, 3, 4]

then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.

Edit: You want to know the index in your bins of each element. For this, you can use numpy.digitize(). If your bins are going to be integral, you can use numpy.bincount() as well.

>>> values = numpy.random.randint(0, 20, 10)
>>> values
array([17, 14,  9,  7,  6,  9, 19,  4,  2, 19])
>>> bins = numpy.linspace(-1, 21, 23)
>>> bins
array([ -1.,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,
        10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,
        21.])
>>> pos = numpy.digitize(values, bins)
>>> pos
array([19, 16, 11,  9,  8, 11, 21,  6,  4, 21])

Since the interval is open on the upper limit, the indices are correct:

>>> (bins[pos-1] == values).all()
True
>>> import sys
>>> for n in range(len(values)):
...     sys.stdout.write("%g <= %g < %g\n"
...             %(bins[pos[n]-1], values[n], bins[pos[n]]))
17 <= 17 < 18
14 <= 14 < 15
9 <= 9 < 10
7 <= 7 < 8
6 <= 6 < 7
9 <= 9 < 10
19 <= 19 < 20
4 <= 4 < 5
2 <= 2 < 3
19 <= 19 < 20

다른 팁

Designer는 GTK # 클래스가 포함 된 모든 파일에 대해 표시되지 않으며 "창", "위젯"및 "대화 상자"템플릿을 사용하여 디자이너 지원으로 작성된 기능 만 나타냅니다.

이러한 템플릿은 프로젝트가 GTK #을 참조하는 경우 사용할 수 있지만 가장 쉬운 방법은 올바른 참조, 앱 초기화 보일러 플레이트 및 디자이너 창 클래스 (mainwindow.cs)를 가진 새로운 GTK # 응용 프로그램 프로젝트를 만드는 것입니다.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top