Bar heights and widths in histogram plot of several data

https://stackoverflow.com/questions/23301039

09-07-2023
|

Question

I'm trying to plot a simple histogram with multiple data in parallel.
My data are a set of 2D ndarrays, all of them with the same dimension (in this example 256 x 256).

I have this method to plot the data set:

def plot_data_histograms(data, bins, color, label, file_path):
        """
        Plot multiple data histograms in parallel
        :param data : a set of data to be plotted
        :param bins : the number of bins to be used
        :param color : teh color of each data in the set
        :param label : the label of each color in the set
        :param file_path : the path where the output will be save
        """
        plt.figure()
        plt.hist(data, bins, normed=1, color=color, label=label, alpha=0.75)
        plt.legend(loc='upper right')
        plt.savefig(file_path + '.png')
        plt.close()

And I'm passing my data as follows:

data = [sobel.flatten(), prewitt.flatten(), roberts.flatten(), scharr.flatten()]
labels = ['Sobel', 'Prewitt', 'Roberts Cross', 'Scharr']
colors = ['green', 'blue', 'yellow', 'red']

plot_data_histograms(data, 5, colors, labels, '../Visualizations/StatisticalMeasures/RMSEHistograms')

And I got this histogram:

histogram

I know that this may be stupid, but I didn't get why my yticks varies from 0 to 4.5. I know that is due the normed parameter, but even reading this;

If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., n/(len(x)*dbin). In a probability density, the integral of the histogram should be 1; you can verify that with a trapezoidal integration of the probability density function.

I didn't really get how it works.

Also, once I set my bins to be equal five and the histogram has exactly 5 xticks (excluding borders), I didn't understand why I have some bars in the middle of some thicks, like the yellow one over the 0.6 thick. Since my number of bins and of xticks matches, I though that each set of four bars should be concentrated inside each interval, like it happens with the four first bars, completely concentrated inside the [0.0, 0.2] interval.

Thank you in advance.

Solution

The reason this is confusing is because you're squishing four histograms on one plot. In order to do this, matplotlib chooses to narrow the bars and put a gap between them. In a standard histogram, the total area of all bins is either 1 if normed or N. Here's a simple example:

 a = np.random.rand(10)
 bins = np.array([0, 0.5, 1.0]) # just two bins
 plt.hist(a, bins, normed=True)

normed

First note that the each bar covers the entire range of its bin: The first bar ranges from 0 to 0.5, and its height is given by the number of points in that range.
Next, you can see that the total area of the two bars is 1 because normed = True: The width of each bar is 0.5 and the heights are 1.2 and 0.8.

Let's plot the same thing again with another distribution so you can see the effect:

 b = np.random.rand(10)
 plt.hist([a, b], bins, normed=True)

normed with two

Recall that the blue bars represent exactly the same data as in the first plot, but they're less than half the width now because they must make room for the green bars. You can see that now two bars plus some whitespace covers the range of each bin. So we must pretend that the width of each bar is actually the width of all bars plus the width of the whitespace gap when we are calculating the bin range and bar area.

Finally, notice that nowhere do the xticks align with the binedges. If you wish, you can set this to be the case manually, with:

plt.xticks(bins)

If you hadn't manually created bins first, you can grab it from plt.hist:

counts, bins, bars = plt.hist(...)
plt.xticks(bins)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow