R - Finding minimum values based on multiple conditions and returning one or multiple created strings based on the minimum value

https://stackoverflow.com/questions/23371517

r
tapply

12-07-2023
|

Pergunta

I'm asking this question as a follow-up to this one which was answered in a pretty neat way by @alexis_laz. Unfortunately his method (which includes creating a long dataframe with loads of zeros) is too data-intense now the original dataset has expanded dramatically.

The basic problem is this one. Consider a dataframe with three columns x,y,z. I am looking for the value(s) of z and x that are associated with the lowest x-value(s) for every y. The ideal output would be a string of the type y[i]_x[i]_z[i] with i the relevant rownumber.

Here is a reproducible example set.seed(1)

x <- rpois(10000, lambda = 10); x[sample.int(50, 20)] <- NA
y <- rep(LETTERS, length.out=10000)
z <- seq(1:10000)
df <- data.frame(cbind(x,y,z))

Desired output (which I found by simply ordering the df and scrolling):

df <- df[order(y,x,z),]

for y = A, min(x) = 2, with z = 313 => the desired result (NAs can be dropped) should be something like paste0(y,"_",x,"_",z) thus A_1_313
for y = B, min(x) = 2, with z = 782, 6008, or 7230 => the desired result would give me all three strings, thus B_2_782, B_2_6008 and B_2_7230
for y = F, min(x) = 3 and this minimum is linked to 5 different z-values (4114, 4712,5336,7234,7520) so I'd like to get five strings ....

I don't expect there to be more than 5 strings as output anywhere in the real data set. As said, @alexis_laz provided a solution to an almost identical problem (also asked by me) but that solution requires the creation of a dataframe that exceeds my computer power (>2.4GB dataframe, 650 million rows) now that my dataset has increased from 37 to 15000 firms :)

Thanks in advance!

PS: I have looked for solutions using max.col, which.max in combination with tapply but none have worked for me so far. It seems that something like tapply(x,y,which.min) simply returns a list of 1s in an ordered df because which.min returns the position within a vector/matrix which is always 1 in the tapply function. Hence something that uses tapply but returns a rownumber of the df would be 99% of the job.

Solução

Edit: I got bit by a subtle data.table behavior. data.table keeps keys on summarized data, but only the ones you summarized on. So the join wasn't doing what I thought it was doing. Here is the exact same logic, but with one interim step to unset the partial key on the grouped data:

# data generated with `set.seed(1)`
library(data.table)
dt <- data.table(x, y, z)[!is.na(x)]
setkey(dt, y, x)                                   # among other things, this sorts `dt` by `x` and `y` quickly
sub.dt <- dt[, list(x=x[[1]]), by=y][, list(y, x)] # get low X for each Y, and reorder cols to match key
setkey(sub.dt, NULL)                               # need to remove key as otherwise would join only on `y`
dt[sub.dt, paste(x, y, z, sep="_")]                # now join

Produces:

    y x       V1
 1: A 1  1_A_313
 2: B 2  2_B_782
 3: B 2 2_B_6008
 4: B 2 2_B_7230
 5: C 2 2_C_2993
 6: D 2 2_D_4762
 7: E 2  2_E_239
 8: E 2 2_E_4581
 9: F 3 3_F_4114
10: F 3 3_F_4712
...
41: S 2 2_S_3113
42: S 2 2_S_7949
43: T 2 2_T_4570
44: U 1  1_U_671
45: V 2  2_V_178
46: W 2 2_W_1817
47: W 2 2_W_2233
48: X 1  1_X_648
49: Y 2  2_Y_857
50: Y 2 2_Y_7227
51: Z 3 3_Z_6526
    y x       V1

Edit2: a cleaner version kindly contributed by Arun in the comments:

dt[dt[, .I[x==min(x)], by=y][, V1]]

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow