Pregunta

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.

# make data with weird column names that can't be hard coded
data = data.frame(
  asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

# plyr - works
ddply(data, columns, summarize, value=mean(value))

# dplyr - raises error
data %.%
  group_by(columns) %.%
  summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds

What am I missing to translate the plyr example into a dplyr-esque syntax?

Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.

¿Fue útil?

Solución 2

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(columns))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27 

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

Otros consejos

Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:

library(dplyr)

df <-  data.frame(
    asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# Columns you want to group by
grp_cols <- names(df)[-3]

# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)

# Perform frequency counts
df %>%
    group_by_(.dots=dots) %>%
    summarise(n = n())

output:

Source: local data frame [9 x 3]
Groups: asihckhdoydk

  asihckhdoydk a30mvxigxkgh  n
1            A            A 10
2            A            B 10
3            A            C 13
4            B            A 14
5            B            B 10
6            B            C 12
7            C            A  9
8            C            B 12
9            C            C 10

The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:

df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))

But that probably won't be there for a while (because I need to think through all the consequences).

In the meantime, you can use regroup(), which takes a list of symbols:

library(dplyr)

df <-  data.frame(
  asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

df %.%
  regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
  summarise(n = n())

If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():

vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)

df %.% regroup(vars2) %.% summarise(n = n())

String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.

The following snippet cleanly solves the problem that @sharoz originally posed (note the need to write out the .dots argument):

# Given data and columns from the OP

data %>%
    group_by_(.dots = columns) %>%
    summarise(Value = mean(value))

(Note that dplyr now uses the %>% operator, and %.% is deprecated).

Update with across() from dplyr 1.0.0

All the answers above are still working, and the solutions with the .dots argument are intruiging.

BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().

So the solution with across looks very readable:

data %>%
  group_by(across(all_of(columns))) %>%
  summarize(Value = mean(value))

Until dplyr has full support for string arguments, perhaps this gist is useful:

https://gist.github.com/skranz/9681509

It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example

cols = c("cyl","gear")
mtcars %.%
  s_group_by(cols) %.%  
  s_summarise("avdisp=mean(disp), max(disp)") %.%
  arrange(avdisp)

It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:

df %.%
    group_by(asdfgfTgdsx, asdfk30v0ja) %.%
    summarise(Value = mean(value))

> df %.%
+   group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+   summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx

  asdfgfTgdsx asdfk30v0ja        Value
1           A           C  0.046538002
2           C           B -0.286359899
3           B           A -0.305159419
4           C           A -0.004741504
5           B           B  0.520126476
6           C           C  0.086805492
7           B           C -0.052613078
8           A           A  0.368410146
9           A           B  0.088462212

where df was your data.

?group_by says:

 ...: variables to group by. All tbls accept variable names, some
      will also accept functons of variables. Duplicated groups
      will be silently dropped.

which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.

@Arun also mentions that you can do:

df %.%
    group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
    summarise(Value = mean(value))

But you can't pass in something that unevaluated is not a name of a variable in the data object.

I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.

data = data.frame(
  my.a = sample(LETTERS[1:3], 100, replace=TRUE),
  my.b = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))

One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:

library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>% 
  # 1. create quantized versions of base variables
  mutate_each(
    funs(Quantized = . > 0)
  ) %>% 
  # 2. group_by the indicator variables
  group_by_(
    .dots = grep("Quantized", names(.), value = TRUE)
    ) %>% 
  # 3. summarize the base variables
  summarize_each(
    funs(sum(., na.rm = TRUE)), contains("X_")
  )

This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.

General example on using the .dots argument as character vector input to the dplyr::group_by function :

iris %>% 
    group_by(.dots ="Species") %>% 
    summarise(meanpetallength = mean(Petal.Length))

Or without a hard coded name for the grouping variable (as asked by the OP):

iris %>% 
    group_by(.dots = names(iris)[5]) %>% 
    summarise_at("Petal.Length", mean)

With the example of the OP:

data %>% 
    group_by(.dots =names(data)[-3]) %>% 
    summarise_at("value", mean)

See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top