Frage

So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:

import numpy
import math



feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);



#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;

#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));

#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
    allRanges[f,0] = numpy.amin(allFeatures[f,:]);
    allRanges[f,1] = numpy.amax(allFeatures[f,:]);

#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));

#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
    freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
    allBins[f,:] = binedges;
    allIndividualProbs[f,:] = freqhist;

#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));

#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
    for b2 in range(numBins):
        for f1 in range(numFrames):
            for f2 in range(numFrames):
                if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
                    jointProbs[b1,b2] += 1;
                    numElements += 1;

jointProbs /= numElements;

#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);

How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?

War es hilfreich?

Lösung

Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:

import numpy as np
numBins = 10  # number of bins in each dimension
data = np.random.randn(100000, 3)  # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top