Weighted mean of similar elements in a vector in R

https://stackoverflow.com/questions/10929732

13-06-2021
|

Question

I have two vectors x and w. w is a numerical vector of weights the same length as x giving the weights to use for elements of x.

I would like to give weighted mean of the elements in vector x which their difference are small(for example 1e-1 or 1e-2) to decrease the length of vector x. for example, these vectors are as follows:

    w =c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04,
         2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04,
         4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03,
         1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02,
         1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03,
         3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01,
         3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04,
         7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04,
         8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04,
         5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07,
         3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07,
         2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07,
         2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07,
         2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03)

     x =c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900,
          0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500,
          1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027,
          2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953,
          2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953,
          3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358,
          3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813,
          0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596,
          2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608,
          3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866)

I know how to sort the vector x according to their weights but how can I recognize the similar values in vector x and then getting their weighted mean?

Solution

UPDATED ANSWER

How about something like this...? (See code below)

I called your original vectors origx and origw, so that the reordered ones are x and w. The code works on temporary copies of x and w (called xtemp and wtemp) which get destroyed, and builds up the new x and w (i.e. the "shorter" vectors you seek) in the variables xnew and wnew.

In simple terms, the code looks at xtemp and finds the first gap exceeding the threshold size (e.g. 0.05), and groups together all the elements from the start of xtemp running up to that "large" gap. (If there's no such gap it takes the whole of xtemp as a group.) The code then converts that group into a single weight called wgroup (the total of the group weights) and a single representative x value called xgroup (such that xgroup*wgroup is the same as the weighted sum of all the group elements). We then save xgroup and wgroup into the vectors xnew and wnew, wipe out the current group (by eliminating it from xtemp and wtemp), and then carry on in the same way till everything's been grouped.

Give it a test run and see what you think :)

origw = c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04,
          2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04,
          4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03,
          1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02,
          1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03,
          3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01,
          3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04,
          7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04,
          8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04,
          5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07,
          3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07,
          2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07,
          2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07,
          2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03)

origx = c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900,
          0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500,
          1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027,
          2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953,
          2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953,
          3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358,
          3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813,
          0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596,
          2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608,
          3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866)

reord = order(origx)
x = origx[reord]
w = origw[reord]

xnew = wnew = c()

thresh = 0.05
xtemp = x
wtemp = w
while (length(xtemp) > 0) {
nextgap = which(diff(xtemp) > thresh)[1]
if (!is.na(nextgap)) {
    group = seq_len(nextgap)
} else {
    group = seq_along(xtemp)
}
xgroup = sum((xtemp*wtemp)[group])/sum(wtemp[group])
wgroup = sum(wtemp[group])
xnew = c(xnew, xgroup)
wnew = c(wnew, wgroup)
xtemp = xtemp[-group]
wtemp = wtemp[-group]
}

OLD RESPONSE IS BELOW (superseded by the above...)

I'd suggest reordering x and w so that x is in strict numerical order, and then using the diff function:

reord = order(x)
x2 = x[reord]
w2 = w[reord]
which(diff(x2)<0.01)

The final command above indicates which elements in x2 (the sorted version of x) are within 0.01 of the next-highest element. The first value is 2 since elements 2 and 3 of x2 are such an example: x2[2]=0.4399527 and x2[3]=0.4429813.

Also, if you do

sort(diff(x2))

you can see all the differences arranged in numerical order, which might help you decide what a suitable cutoff should be.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow