Question

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?

Here's an example with fake data.

library(dplyr)

df = data.frame(a=rep(1:3,4), b=rep(1:2,6))

# Now add an extra level to df$b that has no corresponding value in df$a
df$b = factor(df$b, levels=1:3)

# Summarise with plyr, keeping categories with a count of zero
plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)

  b    count_a
1 1    6
2 2    6
3 3    0

# Now try it with dplyr
df %.%
  group_by(b) %.%
  summarise(count_a=length(a), .drop=FALSE)

  b     count_a .drop
1 1     6       FALSE
2 2     6       FALSE

Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?

Was it helpful?

Solution 2

Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:

df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
df$b = factor(df$b, levels=1:3)

df %>%
  group_by(b, .drop=FALSE) %>%
  summarise(count_a=length(a))

#> # A tibble: 3 x 2
#>   b     count_a
#>   <fct>   <int>
#> 1 1           6
#> 2 2           6
#> 3 3           0

One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:

library(dplyr)
data(iris)

# Add an additional level to Species
iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))

# Species is a factor and empty groups are included in the output
iris %>% group_by(Species, .drop=FALSE) %>% tally

#>   Species         n
#> 1 setosa         50
#> 2 versicolor     50
#> 3 virginica      50
#> 4 empty_level     0

# Add character column
iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))

# Empty groups involving combinations of Species and group2 are not included in output
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#>   Species     group2     n
#> 1 setosa      A         25
#> 2 setosa      B         25
#> 3 versicolor  A         25
#> 4 versicolor  B         25
#> 5 virginica   B         25
#> 6 virginica   C         25
#> 7 empty_level <NA>       0

# Turn group2 into a factor
iris$group2 = factor(iris$group2)

# Now all possible combinations of Species and group2 are included in the output, 
#  whether present in the data or not
iris %>% group_by(Species, group2, .drop=FALSE) %>% tally

#>    Species     group2     n
#>  1 setosa      A         25
#>  2 setosa      B         25
#>  3 setosa      C          0
#>  4 versicolor  A         25
#>  5 versicolor  B         25
#>  6 versicolor  C          0
#>  7 virginica   A          0
#>  8 virginica   B         25
#>  9 virginica   C         25
#> 10 empty_level A          0
#> 11 empty_level B          0
#> 12 empty_level C          0

Created on 2019-03-13 by the reprex package (v0.2.1)

OTHER TIPS

The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:

library(tidyr)
df %>%
  group_by(b) %>%
  summarise(count_a=length(a)) %>%
  complete(b)
# Source: local data frame [3 x 2]
# 
#        b count_a
#   (fctr)   (int)
# 1      1       6
# 2      2       6
# 3      3      NA

If you wanted the replacement value to be zero, you need to specify that with fill:

df %>%
  group_by(b) %>%
  summarise(count_a=length(a)) %>%
  complete(b, fill = list(count_a = 0))
# Source: local data frame [3 x 2]
# 
#        b count_a
#   (fctr)   (dbl)
# 1      1       6
# 2      2       6
# 3      3       0

dplyr solution:

First make grouped df

by_b <- tbl_df(df) %>% group_by(b)

then we summarise those levels that occur by counting with n()

res <- by_b %>% summarise( count_a = n() )

then we merge our results into a data frame that contains all factor levels:

expanded_res <- left_join(expand.grid(b = levels(df$b)),res)

finally, in this case since we are looking at counts the NA values are changed to 0.

final_counts <- expanded_res[is.na(expanded_res)] <- 0

This can also be implemented functionally, see answers: Add rows to grouped data with dplyr?

A hack:

I thought I would post a terrible hack that works in this case for interest's sake. I seriously doubt you should ever actually do this but it shows how group_by() generates the atrributes as if df$b was a character vector not a factor with levels. Also, I don't pretend to understand this properly -- but I am hoping this helps me learn -- this is the only reason I'm posting it!

by_b <- tbl_df(df) %>% group_by(b)

define an "out-of-bounds" value that cannot exist in dataset.

oob_val <- nrow(by_b)+1

modify attributes to "trick" summarise():

attr(by_b, "indices")[[3]] <- rep(NA,oob_val)
attr(by_b, "group_sizes")[3] <- 0
attr(by_b, "labels")[3,] <- 3

do the summary:

res <- by_b %>% summarise(count_a = n())

index and replace all occurences of oob_val

res[res == oob_val] <- 0

which gives the intended:

> res
Source: local data frame [3 x 2]

b count_a
1 1       6
2 2       6
3 3       0

this is not exactly what was asked in the question, but at least for this simple example, you could get the same result using xtabs, for example:

using dplyr:

df %>%
  xtabs(formula = ~ b) %>%
  as.data.frame()

or shorter:

as.data.frame(xtabs( ~ b, df))

result (equal in both cases):

  b Freq
1 1    6
2 2    6
3 3    0
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top