With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group? [duplicate]

StackOverflow https://stackoverflow.com/questions/23419683

  •  13-07-2023
  •  | 
  •  

Question

If needed I can put together a dataset, but my question is somewhat general.

accts <- accts[, .SD[which.max(EE)], by=DnB.Name]

I've got a DT of about 350k rows, and some of the DnB.Name's (Duns and Bradstreet Company Name) are duplicates with differing employee counts (EE), I only care about the highest number of each and can disregard the rest.

Anyway, DT is usually lightning quick, so I figure I must be doing something wrong?

Was it helpful?

Solution

sort by EE, then take the first row for each group using a self join:

 ordered<-accts[order(-EE)] #Descending order
 setkey(ordered,DnB.Name) #must setkey before join
 ordered[J(unique(DnB.Name)),mult="first"]

For reference, check out this post on crossvalidated: https://stats.stackexchange.com/questions/7884/fast-ways-in-r-to-get-the-first-row-of-a-data-frame-grouped-by-an-identifier

EDIT: even faster, but weird syntax:

accts[accts[, .I[which.max(EE)], by = DnB.Name]$V1]

For reference, check this post with a similar question: Subset by group with data.table

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top