First of all, the easiest way of optimizing a function using scipy.optimize
is to construct the target function such that the first argument is a list of the parameters need to be optimized and the following arguments specify other things, such as data and fixed parameters.
Second, it will be very helpful to use the vectorization provided by numpy
Therefore we have these:
In [61]:
#modified pdf and cdf
def gumbel_pdf(x, loc, scale):
"""Returns the value of Gumbel's pdf with parameters loc and scale at x .
"""
# substitute
z = (x - loc)/scale
return (1./scale) * (np.exp(-(z + (np.exp(-z)))))
def gumbel_cdf(x, loc, scale):
"""Returns the value of Gumbel's cdf with parameters loc and scale at x.
"""
return np.exp(-np.exp(-(x-loc)/scale))
In [62]:
def trunc_GBL(p, x):
threshold=p[0]
loc=p[1]
scale=p[2]
x1=x[x<threshold]
nx2=len(x[x>=threshold])
L1=(-np.log((gumbel_pdf(x1, loc, scale)/scale))).sum()
L2=(-np.log(1-gumbel_cdf(threshold, loc, scale)))*nx2
#print x1, nx2, L1, L2
return L1+L2
In [63]:
import scipy.optimize as so
In [64]:
#first we make a simple Gumbel fit
so.fmin(lambda p, x: (-np.log(gumbel_pdf(x, p[0], p[1]))).sum(), [0.5,0.5], args=(np.array(non_truncated_data),))
Optimization terminated successfully.
Current function value: 35.401255
Iterations: 70
Function evaluations: 133
Out[64]:
array([ 16.47028986, 0.72449091])
In [65]:
#then we use the result as starting value for your truncated Gumbel fit
so.fmin(trunc_GBL, [17, 16.47028986, 0.72449091], args=(np.array(non_truncated_data),))
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 25
Function evaluations: 94
Out[65]:
array([ 13.41111111, 16.65329308, 0.79694 ])
Here in the trunc_GBL
function I substituted your pdf with a scaled pdf
See rationales here, basically is it because your L1
is pdf based and L2
is cdf based: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Then we notice a problem see: Current function value: 0.000000
in the last output. The negative loglikelihood function is 0.
This is because:
In [66]:
gumbel_cdf(13.41111111, 16.47028986, 0.72449091)
Out[66]:
2.3923515777163676e-30
Effectively 0. This means based on the model you just described, maximum is always reached when the threshold value is low enough such that L1
is non-exsit (x < threshold
is empty) and L2
is 1 (1-F(C)
is 1
, for all items in your data).
For this reason, your model does not look all that right to me. You may want to rethink about it.
Edit
We can further isolate threshold
and treat it as a fixed parameter:
def trunc_GBL(p, x, threshold):
loc=p[0]
scale=p[1]
x1=x[x<threshold]
nx2=len(x[x>=threshold])
L1=(-np.log((gumbel_pdf(x1, loc, scale)/scale))).sum()
L2=(-np.log(1-gumbel_cdf(threshold, loc, scale)))*nx2
#print x1, nx2, L1, L2
return L1+L2
And call the optimizer differently:
so.fmin(trunc_GBL, [0.5, 0.5], args=(X, np.percentile(X, 20)))
Optimization terminated successfully.
Current function value: 20.412818
Iterations: 72
Function evaluations: 136
Out[9]:
array([ 16.34594943, 0.45253201])
This way if you want 70% quantile, you can simply changed it to np.percentile(X, 30)
and so on. np.percentile()
is just another way of doing .quantile(0.8)