Question

EDIT: %.% operator is now deprecated. Use %>% from magrittr.

ORIGINAL QUESTION What does this %.% operator do?? I've seen it used a lot with the dplyr package, but can't seem to find any supporting documentation on what it is or how it works.

It seems to chain commands together, but that's as far as I can tell...While I'm at it, can anyone explain what the gambit of those special operators that hang around with the % sign do and when is technically the right time to use them to code better?

Was it helpful?

Solution

I think Hadley would be the best person to explain to you, but I will give it a shot.

%.% is a binary operator called chain operator. In Ryou can pretty much define any binary operator of your own with the special character %. From what I have seem, we pretty much use it to make easier "chainable" syntaxes (like x+y, much better than sum(x,y)). You can do really cool stuff with them, see this cool example here.

What is the purpose of %.% in dplyr? To make it easier for you to express yourself, reducing the gap between what you want to do and how you express it.

Taking the example from the introduction to dplyr, let's suppose you want to group flights by year, month and day, select those variables plus the delays in arrival and departure, summarise these by the mean and then filter just those delays over 30. If there were no %.%, you would have to write like this:

filter(
  summarise(
    select(
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    ),
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)

It does the job. But it is pretty difficult to express yourself and to read it. Now, you can write the same thing with a more friendly syntax using the chain operator %.%:

hflights %.%
  group_by(Year, Month, DayofMonth) %.%
  select(Year:DayofMonth, ArrDelay, DepDelay) %.%
  summarise(
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ) %.%
  filter(arr > 30 | dep > 30)

It is easier both to write and read!

And how does that work?

Let's take a look at the definitions. First for %.%:

function (x, y) 
{
    chain_q(list(substitute(x), substitute(y)), env = parent.frame())
}

It uses another function called chain_q. So let's look at it:

function (calls, env = parent.frame()) 
{
    if (length(calls) == 0) 
        return()
    if (length(calls) == 1) 
        return(eval(calls[[1]], env))
    e <- new.env(parent = env)
    e$`__prev` <- eval(calls[[1]], env)
    for (call in calls[-1]) {
        new_call <- as.call(c(call[[1]], quote(`__prev`), as.list(call[-1])))
        e$`__prev` <- eval(new_call, e)
    }
    e$`__prev`
}

What does that do?

To simplify things, let's assume you called: group_by(hflights,Year, Month, DayofMonth) %.% select(Year:DayofMonth, ArrDelay, DepDelay).

Your calls x and y are then both group_by(hflights,Year, Month, DayofMonth) and select(Year:DayofMonth, ArrDelay, DepDelay). So the function creates a new environment called e (e <- new.env(parent = env)) and saves an object called __prev with the evaluation of the first call (e$'__prev' <- eval(calls[[1]], env). Then for each other call it creates another call whose first argument is the previous call - that is __prev - in our case it would be select('__prev', Year:DayofMonth, ArrDelay, DepDelay) - so it "chains" the calls inside the loop.

Since you can use binary operators one over another, you actually can use this syntax to express very complex manipulations in a very readable way.

OTHER TIPS

A quick search landed me here:

dplyr provides another innovation over plyr: the ability to chain operations together from left to right with the %.% operator. This makes dplyr behave a little like a grammar of data manipulation.

Example:

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(G)) %.%
  arrange(desc(total)) %.%
  head(5)`

Read more about it from the help section, ?"%.%".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top