Question

How can I group a density plot and have the density of each group sum to one, when using weighted data?

The ggplot2 help for geom_density() suggests a hack for using weighted data: dividing by the sum of the weights. But when grouped, this means that the combined density of the groups totals one. I would like the density of each group to total one.

I have found two clumsy ways to do this. The first is to treat each group as a separate dataset:

library(ggplot2)
library(ggplot2movies) # load the movies dataset

m <- ggplot()
m + geom_density(data = movies[movies$Action == 0, ], aes(rating, weight = votes/sum(votes)), fill=NA, colour="black") +
    geom_density(data = movies[movies$Action == 1, ], aes(rating, weight = votes/sum(votes)), fill=NA, colour="blue")

Obvious disadvantages are the manual handling of factor levels and aesthetics. I also tried using the windowing functionality of the data.table package to create a new column for the total votes per Action group, dividing by that instead:

movies.dt <- data.table(movies)
setkey(movies.dt, Action)
movies.dt[, votes.per.group := sum(votes), Action]
m <- ggplot(movies.dt, aes(x=rating, weight=votes/votes.per.group, group = Action, colour = Action))
m + geom_density(fill=NA)

Are there neater ways to do this? Because of the size of my tables, I'd rather not replicate rows by their weighting for the sake of using frequency.

Was it helpful?

Solution

Using dplyr

library(dplyr)
library(ggplot2)
library(ggplot2movies)

movies %>% 
  group_by(Action) %>% 
  mutate(votes.grp = sum(votes)) %>% 
  ggplot(aes(x=rating, weight=votes/votes.grp, group = Action, colour = Action)) +
  geom_density()

graph output by the code

OTHER TIPS

I think an auxillary table might be your only option. I had a similar problem here. The issue it seems is that, when ggplot uses aggregating functions in aes(...), it applies them to the whole dataset, not the subsetted data. So when you write

aes(weight=votes/sum(votes))

the votes in the numerator is subsetted based on Action, but votes in the denominator, sum(votes), is not. The same is true for the implicit grouping with facets.

If someone else has a way around this I'd love to hear it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top