How to find parameters of gumbel's distribution using scipy.optimize

Question

First of all, the easiest way of optimizing a function using scipy.optimize is to construct the target function such that the first argument is a list of the parameters need to be optimized and the following arguments specify other things, such as data and fixed parameters.

Second, it will be very helpful to use the vectorization provided by numpy

Therefore we have these:

In [61]:
#modified pdf and cdf
def gumbel_pdf(x, loc, scale):
    """Returns the value of Gumbel's pdf with parameters loc and scale at x .
    """
    # substitute
    z = (x - loc)/scale
 
    return (1./scale) * (np.exp(-(z + (np.exp(-z)))))
 
def gumbel_cdf(x, loc, scale):
    """Returns the value of Gumbel's cdf with parameters loc and scale at x.
    """
    return np.exp(-np.exp(-(x-loc)/scale))
In [62]:

def trunc_GBL(p, x):
    threshold=p[0]
    loc=p[1]
    scale=p[2]
    x1=x[x<threshold]
    nx2=len(x[x>=threshold])
    L1=(-np.log((gumbel_pdf(x1, loc, scale)/scale))).sum()
    L2=(-np.log(1-gumbel_cdf(threshold, loc, scale)))*nx2
    #print x1, nx2, L1, L2
    return L1+L2
In [63]:

import scipy.optimize as so
In [64]:
#first we make a simple Gumbel fit
so.fmin(lambda p, x: (-np.log(gumbel_pdf(x, p[0], p[1]))).sum(), [0.5,0.5], args=(np.array(non_truncated_data),))
Optimization terminated successfully.
         Current function value: 35.401255
         Iterations: 70
         Function evaluations: 133
Out[64]:
array([ 16.47028986,   0.72449091])
In [65]:
#then we use the result as starting value for your truncated Gumbel fit
so.fmin(trunc_GBL, [17, 16.47028986,   0.72449091],  args=(np.array(non_truncated_data),))
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 25
         Function evaluations: 94
Out[65]:
array([ 13.41111111,  16.65329308,   0.79694   ])

Here in the trunc_GBL function I substituted your pdf with a scaled pdf

enter image description here

See rationales here, basically is it because your L1 is pdf based and L2 is cdf based: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm

Then we notice a problem see: Current function value: 0.000000 in the last output. The negative loglikelihood function is 0.

This is because:

In [66]:

gumbel_cdf(13.41111111,  16.47028986,   0.72449091)
Out[66]:
2.3923515777163676e-30

Effectively 0. This means based on the model you just described, maximum is always reached when the threshold value is low enough such that L1 is non-exsit (x < threshold is empty) and L2 is 1 (1-F(C) is 1, for all items in your data).

For this reason, your model does not look all that right to me. You may want to rethink about it.

Edit

We can further isolate threshold and treat it as a fixed parameter:

def trunc_GBL(p, x, threshold):
    loc=p[0]
    scale=p[1]
    x1=x[x<threshold]
    nx2=len(x[x>=threshold])
    L1=(-np.log((gumbel_pdf(x1, loc, scale)/scale))).sum()
    L2=(-np.log(1-gumbel_cdf(threshold, loc, scale)))*nx2
    #print x1, nx2, L1, L2
    return L1+L2

And call the optimizer differently:

so.fmin(trunc_GBL, [0.5, 0.5], args=(X, np.percentile(X, 20)))
Optimization terminated successfully.
         Current function value: 20.412818
         Iterations: 72
         Function evaluations: 136
Out[9]:
array([ 16.34594943,   0.45253201])

This way if you want 70% quantile, you can simply changed it to np.percentile(X, 30) and so on. np.percentile() is just another way of doing .quantile(0.8)