Calculation on subset of data without first saving subset as new data.frame

https://stackoverflow.com/questions/22800447

25-06-2023
|

Question

I have two data.frames and I am using them to create a new variable C (a standardized distance measure). Each data.frame has the following information (Coordinates, Season, Variables. I am going to calculate C between df.a and df.b for every unique, coordinate-season (i.e. each XX, YY - X,Y pair by season). To this end I have merged the two data.frames (df.new) to prep for calcualting C.

Here is how I currently would perform this operation:

# for example, for season = SUM
# V1 and VV1 are the same variable from the different dataframes, SEA = Season, 
# X,Y and XX, YY are coordinates 
df.new.SUM <- subset(df.new, SEA == "SUM") # Summer
attach(df.new.SUM)
df.new.SUM$C_V1 <- (V1-VV1)^2/sd(V1)^2 # almost wouldn't need to subset except that the denominator here should only be for one season
df.new.SUM$C_V2 <- (V2-VV2)^2/sd(V2)^2
df.new.SUM$C <- sqrt(rowSums(df.new.SUM[,c("C_V1","C_V2")]))
# continue for other seasons and then rbind

However, this seems approach seems clunky. Is there way to calculate C for each season - coordinate group without subsetting into a data.frame and then rbinding for each season? How can I only use one season without subsetting into a new data.frame? Or, even better, how do I do this for each season in a vectorized way? What r packages should I be exploring?

df.a <- structure(list(XX = c(10L, 10L, 11L, 11L, 12L, 12L, 13L, 13L, 
14L, 14L), YY = c(20L, 20L, 21L, 21L, 22L, 22L, 23L, 23L, 15L, 
15L), SEA = c("SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", 
"WIN", "SUM", "WIN"), VV1 = c(10.5, 15, 8, 8.5, 8, 7.5, 11, 13, 
15, 10), VV2 = c(13, 3, 3.5, 6, 3.5, 3, 5, 4, 5, 5)), .Names = c("XX", 
"YY", "SEA", "VV1", "VV2"), row.names = c(NA, -10L), class = "data.frame")
#
df.b <- structure(list(X = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Y = c(1L, 1L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L
), SEA = c("SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", "WIN", 
"SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", "WIN", "SUM", 
"WIN", "SUM", "WIN"), V1 = c(10, 12, 10, 9.5, 10, 14.5, 10.5, 
13, 11.5, 14, 12.5, 8.5, 10, 7.5, 11, 7, 11, 8, 11, 14.5), V2 = c(3.5, 
3, 3.5, 2.5, 3, 5, 5.5, 4, 2, 2.5, 3.5, 2, 3.5, 4.5, 5.5, 3.5, 
5, 6, 6, 5)), .Names = c("X", "Y", "SEA", "V1", "V2"), row.names = c(NA, 
-20L), class = "data.frame")
#
df.new <- merge(df.a, df.b, by = c("SEA"), all.x = TRUE, allow.cartesian=TRUE)
#
# EDIT ## solution based on suggestions below
df.out <- data.frame()
seasons <- unique(df.new$SEA)
for (s in seasons){
  data <- subset(df.new, SEA == s)
  data$C <- sqrt(with(data, (V1-VV1)^2/sd(V1)^2 +(V2-VV2)^2/sd(V2)^2 ))
  df.out <- rbind(df.out,data)

}

Solution

Just wrap the steps together and please do not use attach in the future:

df.new.SUM$C <- sqrt( with(df.new.SUM, (V1-VV1)^2/sd(V1)^2 +(V2-VV2)^2/sd(V2)^2 ) )

The with function is much safer. BUT, maybe that wasn't what you wanted. There were 50 "combinations" of SEA=="SUM" in the merged dataset in the cross-product from merge but those were not what your English language description was specifying.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow