How to compare different distribution means with reference truth value in Matlab?

https://stackoverflow.com/questions/3732096

03-10-2019
|

Question

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:

Matrix_1 = 1 row x 20 column 

Matrix_2 = 100 rows x 20 columns 

Matrix_3 = 100 rows x 20 columns 

Matrix_4 = 100 rows x 20 columns

The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).

Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.

How can this be done in Matlab?

I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.

mean of Matrix_1 = (1 row x 1 column)

mean of Matrix_2 = (100 rows x 1 column)

mean of Matrix_3 = (100 rows x 1 column)

mean of Matrix_4 = (100 rows x 1 column)

I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!

EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:

ci_a1 =

  1.0e+008 *

   4.084733001497999
   4.097677503988565

ci_a2 =

  1.0e+008 *

   5.424396063219890
   5.586301025525149

ci_a3 =

  1.0e+008 *

   2.429145282593182
   2.838897116739112

p_a1 =

    8.094614835195452e-130

p_a2 =

    2.824626709966993e-072

p_a3 =

    3.054667629953656e-012

h_a1 = 1; h_a2 = 1;  h_a3 = 1

None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?

Solution

If I understand correctly the calculation in MATLAB is pretty strait-forward.

Steps 1-2 (mean calculation):

k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);

Step 3, use HIST to plot distribution histograms:

hist([k2_mean; k3_mean; k4_mean]')

Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.

[h,p,ci] = ttest(k2_mean,k1_mean);

OTHER TIPS

EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.

Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.

This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.

So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com

Your question and your method aren't really clear :

Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :

alt text

is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -

Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.

On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow