Question

I have a list of ages in days and I am looking to display them in years on a density plot.

I did this two ways - changing the labels on the x axis to years and by dividing the data by 365. These methods give me different density estimates:

df <- data.frame(id = 1:80000, age = rnorm(80000, 46, 5) * 365)

The first plot is generated using:

breaks <- seq(from = min(df$age), to = max(df$age), by = 10*365)
ggplot(data = df, aes(x = age)) + 
    geom_density(aes(y = ..density..)) + 
    scale_x_continuous(breaks= breaks, labels = floor(breaks/365))

enter image description here The density displayed on the y-axis ranges from 0 to 0.0002

When I do this however (divide the ages by 365 to get years - not just change the x labels like above):

ggplot(data = df, aes(x = age/365)) + 
    geom_density(aes(y = ..density..))

The plot looks the same but the density ranges from 0 to 0.08 I am struggling to understand what is going on - why is the density different between the two plots?

enter image description here

Was it helpful?

Solution

The density is different in the two plots because in one case you have 365 times as many units horizontally, so the vertical units will need to be 1/365th those of the other plot, given that probability density functions (the areas under these curves) must sum to one.

This is easier to think about in terms of bins rather than density curves. If you have one bin replacing 365 bins, the probability of landing in the one bin is much higher than the average probability of landing in the individual bins.

For the specific sample data you provide, we can see the conversion between the vertical units by looking at the peaks of both functions:

> max(density(df$age)$y) # max of density in days, more horizontal units
[1] 0.0002178977
> df$ageinyears <- df$age/365 # create an age-in-years variable
> max(density(df$ageinyears)$y) # max density in years, fewer horizontals
[1] 0.07953267
> max(density(df$age)$y)*365 
[1] 0.07953267

The practical reason this is an issue in plotting (and possibly the main thrust of your question) is the function that is estimating the density for ggplot is inheriting the x argument from the parent aes(). So it does not know anything about the custom x-axis you are using. Rather than just changing the x-axis in your first plot, you could explicitly tell geom_density not to use the inherited x values:

ggplot(data = df, aes(x = age)) + 
    geom_density(aes(x = age/365, y = ..density..))

OTHER TIPS

The best advice is to just ignore the tick labels on the y-axis, they don't help at all with interpreting the density plot and as you have seen are more likely to confuse than to help.

My preference would be for the default behavior of density plots, histograms, and any similar plots to not label the y-axis tick marks since they generally don't mean anything and only tend to distract from the important parts of the graph and often cause confusion. Even when they are scaled to values intended to be meaningful they are not helpful for the main purpose of the plot and can still cause confusion (I changed the number of bins in my histogram and now my y-tick labels are very different, panic! panic!). Unfortunatly there is so much inertia in plotting them that I alone am unlikely to get this changed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top