Question

I'm using the following function, grp to aggregate with data.table and running into a problem.

The problem is that the order of the levels of the factor variable fc_x is not keept in the same order after aggregation. Is there a problem with my function, or is this "normal" meaning it has an explanation?

grp <- function(x) {
  percentage = as.numeric(table(x)/length(x))
  list(x = factor(levels(x)),
       percentage = percentage,
       label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
  )
}

set.seed(123)
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
            labels = c("0-50", "51-100", "+100"))

str(DT)
# Classes ‘data.table’ and 'data.frame':  100 obs. of  3 variables:
# $ x   : num  90.7 59.4 18 125.4 187.7 ...
# $ fac : Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ fc_x: Factor w/ 3 levels "0-50","51-100",..: 2 2 1 3 3 3 3 3 1 1 ...

levels(DT$fc_x)
# [1] "0-50"   "51-100" "+100"

AGG <- DT[, grp(fc_x), by=fac]

levels(AGG$x)
# [1] "+100"   "0-50"   "51-100"

EDIT

Changing the "+100" for "1000" provides a similar result

DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
               labels = c("0-50", "51-100", "1000"))

levels(DT$fc_x)
# [1] "0-50"   "51-100" "1000"

AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50"   "1000"   "51-100"

Using ordered = TRUE in the cut() statement provides the same result

DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T, ordered = T,
               labels = c("0-50", "51-100", "1000"))

levels(DT$fc_x)
# [1] "0-50"   "51-100" "1000"

AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50"   "1000"   "51-100"
Was it helpful?

Solution

I think the issue is when you define in x in your function you are not supplying the labels so it just puts the factor levels in alphabetical order, so I think you just need to add the labels to your function.

DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), rigth = T, 
labels = c("0-50", "51-100",  "+100"))

factor(levels(DT$fc_x))
[1] 0-50   51-100 +100  
Levels: 0-50 +100 51-100

factor(levels(DT$fc_x),  labels = c("0-50", "51-100", "100+"))
[1] 0-50   +100   51-100
Levels: 0-50 51-100 +100


grp <- function(x) {
  percentage = as.numeric(table(x)/length(x))
  list(
       x = factor(levels(x), labels = levels(x)),
       percentage = percentage,
       label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
  )
}

DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))

DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), rigth = T,
               labels = c("0-50", "51-100", "+100"))
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
[1] "0-50"   "51-100" "100+"  

OTHER TIPS

After using the modified version of the grp function with a real dataset, the levels were fine but weren´t matching the real values after aggregation.

I came up with this, i believe simpler solution to pass names to the tables result. If i don´t use as.numeric(table(...)) i keep the names.

Thanks for the help matt, Matthew. I´ll leave as accepted your answer as it was helpful.

grp <- function(x) {
  percentage = data.frame(table(x)/length(x))
  list(x = factor(percentage[[1]]),
       percentage = percentage[[2]],
       label = paste0( round( as.numeric(percentage[[2]], 2 ) * 100 ) , "%")
  )
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top