What's the equivalent of Ruby's pnormaldist statistics function in Haskell?

https://stackoverflow.com/questions/6116770

29-12-2020
|

문제

As seen here: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

Here's the Ruby code itself, implemented in the Statistics2 library:

# inverse of normal distribution ([2])
# Pr( (-\infty, x] ) = qn -> x
def pnormaldist(qn)
  b = [1.570796288, 0.03706987906, -0.8364353589e-3,
       -0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
       -0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
       0.3657763036e-10, 0.6936233982e-12]

  if(qn < 0.0 || 1.0 < qn)
    $stderr.printf("Error : qn <= 0 or qn >= 1  in pnorm()!\n")
    return 0.0;
  end
  qn == 0.5 and return 0.0

  w1 = qn
  qn > 0.5 and w1 = 1.0 - w1
  w3 = -Math.log(4.0 * w1 * (1.0 - w1))
  w1 = b[0]
  1.upto 10 do |i|
    w1 += b[i] * w3**i;
  end
  qn > 0.5 and return Math.sqrt(w1 * w3)
  -Math.sqrt(w1 * w3)
end

해결책

This is pretty straightforward to translate:

module PNormalDist where

pnormaldist :: (Ord a, Floating a) => a -> Either String a
pnormaldist qn
  | qn < 0 || 1 < qn = Left "Error: qn must be in [0,1]"
  | qn == 0.5        = Right 0.0
  | otherwise        = Right $
      let w3 = negate . log $ 4 * qn * (1 - qn)
          b = [ 1.570796288, 0.03706987906, -0.8364353589e-3, 
                -0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5, 
                -0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8, 
                0.3657763036e-10, 0.6936233982e-12]
          w1 = sum . zipWith (*) b $ iterate (*w3) 1
      in (signum $ qn - 0.5) * sqrt (w1 * w3)

First off, let's look at the ruby - it returns a value, but sometimes it prints an error message (when given an improper argument). This isn't very haskellish, so let's have our return value be Either String a - where we'll return a Left String with an error message if given an improper argument, and a Right a otherwise.

Now we check the two cases at the top:

qn < 0 || 1 < qn = Left "Error: qn must be in [0,1]" - this is the error condition, when qn is out of range.
qn == 0.5 = Right 0.0 - this is the ruby check qn == 0.5 and return * 0.0

Next up, we define w1 in the ruby code. But we redefine it a few lines later, which isn't very rubyish. The value that we store in w1 the first time is used immediately in the definition of w3, so why don't we skip storing it in w1? We don't even need to do the qn > 0.5 and w1 = 1.0 - w1 step, because we use the product w1 * (1.0 - w1) in the definition of w3.

So we skip all that, and move straight to the definition w3 = negate . log $ 4 * qn * (1 - qn).

Next up is the definition of b, which is a straight lift from the ruby code (ruby's syntax for an array literal is haskell's syntax for a list).

Here's the most tricky bit - defining the ultimate value of w3. What the ruby code does in

w1 = b[0]
1.upto 10 do |i|
  w1 += b[i] * w3**i;
end

Is what's called a fold - reducing a set of values (stored in a ruby array) into a single value. We can restate this more functionally (but still in ruby) using Array#reduce:

w1 = b.zip(0..10).reduce(0) do |accum, (bval,i)|
  accum + bval * w3^i
end

Note how I pushed b[0] into the loop, using the identity b[0] == b[0] * w3^0.

Now we could port this directly to haskell, but it's a bit ugly

w1 = foldl 0 (\accum (bval,i) -> accum + bval * w3**i) $ zip b [0..10]

Instead, I broke it up into several steps - first off, we don't really need i, we just need the powers of w3 (starting at w3^0 == 1), so let's calculate those with iterate (*w3) 1.

Then, rather than zipping those into pairs with the elements of b, we ultimately just need their products, so we can zip them into the products of each pair using zipWith (*) b.

Now our folding function is really easy - we just need to sum up the products, which we can do using sum.

Lastly, we decide whether to return plus or minus sqrt (w1 * w3), according to whether qn is greater or less than 0.5 (we already know it's not equal). So rather than calculating the square root in two separate locations as in the ruby code, I calculated it once, and multiplied it by +1 or -1 according to the sign of qn - 0.5 (signum just returns the sign of a value).

다른 팁

Digging around on Hackage, there's a number of libraries for statistics:

hmatrix-gsl-stats -- a pure binding to GSL
hstatistics -- an even higher level interface to GSL
hstats -- common statistical methods
statistics -- more common statistical methods
statistics-linreg -- a linear regression between two samples, based on the other statistics package.

You want a version of pnormaldist, which "Returns the P-value of normaldist(x)".

Statistics.Distribution.Normal, from the statistics package, provides many functions for manipulating normal distributions.
Statistics.Test.NonParametric contains a number of things to do with P-values.

Perhaps something there provides what you need?

The function you want is now available in the erf package on hackage. It's called invnormcdf.

here is my Wilson's score confidence interval for a Bernoulli parameter in node.js

wilson.normaldist = function(qn) {
    var b = [1.570796288, 0.03706987906, -0.0008364353589, -0.0002250947176, 0.000006841218299, 0.000005824238515, -0.00000104527497, 0.00000008360937017, -0.000000003231081277,
        0.00000000003657763036, 0.0000000000006936233982
    ];
    if (qn < 0.0 || 1.0 < qn) return 0;
    if (qn == 0.5) return 0;
    var w1 = qn;
    if (qn > 0.5) w1 = 1.0 - w1;
    var w3 = -Math.log(4.0 * w1 * (1.0 - w1));
    w1 = b[0];

    function loop(i) {
        w1 += b[i] * Math.pow(w3, i);
        if (i < b.length - 1) loop(++i);
    };
    loop(1);
    if (qn > 0.5) return Math.sqrt(w1 * w3);
    else return -Math.sqrt(w1 * w3);
}

wilson.rank = function(up_votes, down_votes) {
    var confidence = 0.95;
    var pos = up_votes;
    var n = up_votes + down_votes;
    if (n == 0) return 0;
    var z = this.normaldist(1 - (1 - confidence) / 2);
    var phat = 1.0 * pos / n;
    return ((phat + z * z / (2 * n) - z * Math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 10000;
}

A brief look on hackage didn't reveal anything, so I suggest you translate the ruby code to Haskell. It's simple enough.

The Ruby code is undocumented; there is no specification of what this function is supposed to do. How does anyone know whether it does correctly whatever is intended?

I wouldn't just blindly copy and paste this arithmetic from one implementation to another (like the author of the Ruby package did).

A citation is given as ([2]) in a comment, but this is dangling. We find it in the comment block of the native C code in the _statistics2.c file.

/*
   statistics2.c
   distributions of statistics2
   by Shin-ichiro HARA
   2003.09.25
   Ref:
   [1] http://www.matsusaka-u.ac.jp/~okumura/algo/
   [2] http://www5.airnet.ne.jp/tomy/cpro/sslib11.htm
*/

Very sloppy work to only cite the C source code from where the coefficients were cribbed, rather than the original source of the formula.

The [1] link doesn't work any more; server not found. Luckily, the one we want is [2]. This is a page in Japanese with some C code for various functions. References are given. The one we want is pnorm. In the table, the algorithm is attributed to 戸田の近似式 which means "Toda's Approximation".

Toda is a common surname in Japan; more detective work is required to find out who this is.

After much effort, here we go: paper (Japanese): The Minimax Approximation for Percentage Points of the Standard Normal Distribution (1993) by Hideo Toda and Harumi Ono.

The algorithm is attributed to Toda (I'm assuming the same one that is the paper's co-author), dated to 1967 on P. 19.

It seems fairly obscure; the likely rationale for using it in the Ruby package is that it was found in source code of domestic origin citing the name of a domestic academic.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow