質問

I was wondering if someone could please explain what the following functions in scipy.stats do:

rv_continuous.expect
rv_continuous.pdf

I have read the documentation but I am still confused.

Here is my task, quite simple in theory, but I am still confused with what these functions do.

So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".

So, what I thought is:

scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)

So that i can get the probability that any area is between sup and inf.

Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?

Thanks

Blaise

役に立ちましたか?

解決 2

The cumulative density function might give you what you want. Then the probability P of being between two values is P(inf < area < sup) = cdf(sup) - cdf(inf)

There's a tutorial about probabilities here and here They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.

E[x] = sum(x.P(x))

他のヒント

rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.

Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.

There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.

In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.

The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top