Question

Apologies if this is a bit of a simple question, but I haven't been able to find any answer to this over the past week and it's driving me crazy.

Background Info: I have a dataset that tracks the weight of 5 individuals over 5 years. Each year, I have a distribution for the weight of individuals in the group, from which I calculate the mean and standard deviation. Data is as follows:

 Year = [2002,2003,2004,2005,2006]
 Weights_2002 = [12, 14, 16, 18, 20]
 Weights_2003 = [14, 16, 18, 20,20]
 Weights_2004 = [16, 18, 20, 22, 18]
 Weights_2005 = [18, 21, 22, 22, 20]
 Weights_2006 = [2, 21, 19, 20, 20]

The Question: How do I project annual distributions of weight for the group the next 10 years? Ideally, I would like the uncertainty about the mean to increase as time goes on. Likewise, I would like the uncertainty about the standard deviation to increase too. Phrased another way, I would like to project the distributions of weight going forward, accounting for both:

  1. Natural Variance in the Data
  2. Increasing uncertainty.

Any help would be greatly, greatly appreciated. If anyone can suggest how to do this in R, that would be even better.

Thanks guys!

Was it helpful?

Solution

Absent specific suggestions on how to use the forecasting tools in R, viz. the comments to your question, here is an alternative approach that uses Monte Carlo simulation.

First, some housekeeping: the value 2 in Weights_2006 is either a typo or an outlier. Since I can't tell which, I will assume it's an outlier and exclude it from the analysis.

Second, you say you want to project the distributions based on increasing uncertainty. But your data doesn't support that.

Year <- c(2002,2003,2004,2005,2006)
W2   <- c(12, 14, 16, 18, 20)
W3   <- c(14, 16, 18, 20,20)
W4   <- c(16, 18, 20, 22, 18)
W5   <- c(18, 21, 22, 22, 20)
W6   <- c(NA, 21, 19, 20, 20)
df   <- rbind(W2,W3,W4,W5,W6)
df   <- data.frame(Year,df)

library(reshape2)   # for melt(...)
library(ggplot2)
data <- melt(df,id="Year", variable.name="Individual",value.name="Weight")
ggplot(data)+
  geom_histogram(aes(x=Weight),binwidth=1,fill="lightgreen",colour="grey50")+
  facet_grid(Year~.)

The mean weight goes up over time, but the variance decreases. A look at the individual time series shows why.

ggplot(data, aes(x=Year, y=Weight, color=Individual))+geom_line()

In general, an individual's weight increases linearly with time (about 2 units per year), until it reaches 20, when it stops increasing but fluctuates. Since your initial distribution was uniform, the individuals with lower weight saw an increase over time, driving the mean up. But the weight of heavier individuals stopped growing. So the distribution gets "bunched up" around 20, resulting in a decreasing variance. We can see this in the numbers: increasing mean, decreasing standard deviation.

smry <- function(x)c(mean=mean(x),sd=sd(x))
aggregate(Weight~Year,data,smry)
#   Year Weight.mean  Weight.sd
# 1 2002  16.0000000  3.1622777
# 2 2003  17.6000000  2.6076810
# 3 2004  18.8000000  2.2803509
# 4 2005  20.6000000  1.6733201
# 5 2006  20.0000000  0.8164966

We can model this behavior using a Monte Carlo simulation.

set.seed(1)
start <- runif(1000,12,20)
X <- start
result <- X
for (i in 2003:2008){
  X <- X + 2
  X <- ifelse(X<20,X,20) +rnorm(length(X))
  result <- rbind(result,X)
}
result <- data.frame(Year=2002:2008,result)

In this model, we start with 1000 individuals whose weight forms a uniform distribution between 12 and 20, as in your data. At each time step we increase the weights by 2 units. If the result is >20 we clip it to 20. Then we add random noise distributed as N[0,1]. Now we can plot the distributions.

model <- melt(result,id="Year",variable.name="Individual",value.name="Weight")
ggplot(model,aes(x=Weight))+
  geom_histogram(aes(y=..density..),fill="lightgreen",colour="grey50",bins=20)+
  stat_density(geom="line",colour="blue")+
  geom_vline(data=aggregate(Weight~Year,model,mean), aes(xintercept=Weight), colour="red", size=2, linetype=2)+
  facet_grid(Year~.,scales="free")

The red bars show the mean weight in each year.

If you believe that the natural variation in the weight of an individual increases over time, then use N[0,sigma] as the error term in the model, with sigma increasing with Year. The problem is that there is nothing in your data to support that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top