Question

I have produced some density plots using ggplot2 and stat_density. My colleague mentioned he wasn't convinced that the area under each curve would sum to 1. So, I set out to calculate the area under the curve, and I am wondering if there might be a better approach than what I did.

Here is an example of what I did:

data(iris)

p<-ggplot(iris,aes(x=Petal.Length))+
          stat_density(aes(colour=Species),geom="line",position="identity")

q<-print(p)
q<-q$data[[1]]

# calculate interval between density estimates for a given point.
# assume it is the same interval for all estimates
interval<-q$x[2]-q$x[1] 

# calculate AUC by summing interval*height for the density estimate at each point
tapply(q$density*interval,  
       q$group,
       sum)

The result:

    1         2         3 
0.9913514 1.0009785 0.9817040 

It seems to works decently, but I wonder if there is a better way of doing this. In particular, my calculation of the interval (i.e. dx, I suppose) seems like it could be a problem, especially if the different density curves use different intervals.

Was it helpful?

Solution

Your way is already good.

Another way to do it is using the trapezoid rule:

data <- cbind(q$x, q$y)
by(data, q$group, FUN = function(x) trapz(x[, 1], x[, 2]))

The results are nearly the same:

INDICES: 1
[1] 0.9903457

INDICES: 2
[1] 1.000978

INDICES: 3
[1] 0.9811152

This is because at the bandwidth needed to make the graph of the densities look reasonable (interval in your code), you are very close to what you would get if you could do the actual integral.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top