Question

I need to do Probability Density Prediction of the following data in R:

year = c(1971, 1984, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 
2007, 2008, 2009, 2010, 2011, 2012, 2013)
incidents = c(1, 1, 1, 1, 3, 1, 6, 6, 9, 11, 21, 37, 38, 275, 226, 774, 1064)

The are data.frame in R like:

dat <- data.frame(year,incidents)

The goal and idea is to make predictions based on a few years and "predict" for the last year of the data available.

I'm new in R so any suggestions, advise and so forth is welcome. Thanks in advance.

Was it helpful?

Solution

Broadly, the two main approaches to modeling are the so-called "mechanistic" and "empirical" approaches. Both have their adherents (and their detractors). The mechanistic approach asserts that modeling should proceed from an understanding of the underlying phenomena (mechanism), which is then translated to some type of mathematical equation(s), which are then fit to the data (to test the mechanism). The empirical approach assembles a (usually long) list of models (equations) and seeks to find the one that "fits best". Empirical modeling is appealing but dangerous because assessing when you have a "good fit" is not trivial - although it is often treated that way.

You have not given us nearly enough information to formulate a mechanistic model, so here's an illustration of a couple of empirical models, as a cautionary tale:

Finite-time singularity models are popular with your type of data. Among other things, these models are used to "predict" stock market bubbles (the LPPL model). The basic idea is that a catastrophe (singularity) is coming, and we want to predict when. So we use a function of the form:

y = a × (c-x)b

With b < 0, y approaches a singularity as x -> c.

In R code, we can fit a model like this as follows:

# Finite-Time Singularity Model
library(minpack.lm)
f <- function(par,x) {
  a <- par[1]
  b <- par[2]
  c <- par[3]
  a * (c - x)^b
}
resid   <- function(par,obs,xx) {obs-f(par,xx)}
start <- c(a=1, b=-1, c=2100)
nls.out <- nls.lm(par=start, fn=resid, obs =dat$incidents, xx=dat$year, 
                  control = nls.lm.control(maxiter=500))
coef(nls.out)
with(dat, plot(incidents~year, main="Finite-Time Singularity Model"))
lines(dat$year,f(coef(nls.out),year), col=2, lwd=2)

This gives what appears to be a "pretty good fit":

In fact, the model overstates incidents early on, and tends to understate them later (which is terrible because we want a prediction for the future). The residuals plot shows this clearly.

with(dat,plot(year,resid(coef(nls.out),incidents,year),
              main="Residuals Plot", ylab="residuals"))

Another approach notes that your data is "counts" (e.g. number of incidents per year). This suggests a generalized linear model in the poisson family:

# generalized liner model, poisson family
fit.glm <- glm(incidents ~year,data=dat,family=poisson)
with(dat,plot(incidents~year))
lines(dat$year,predict(fit.glm,type="response"), col=2, lwd=2)
par(mfrow=c(2,2))
plot(fit.glm)

This fit is better, but still not very good, as the diagnostic plots show. The residuals follow a trend, they are not normally distributed, and some of the data points have unacceptably high leverage.

OTHER TIPS

dat <- data.frame(year,incidents)
with(dat, plot(incidents~year))

enter image description here

So something changed ... but what is causing the abrupt increase in the number if incidents? Only you, the scientist, has the key. You can probably predict that there will be some increase in the next year or two but whether that increase will follow an exponential or logistic pattern is determined by the underlying domain of study. A logistic model would not be terribly accurate if you are in what is usually called the "log phase" of growth because the upper limit of incidents per year is not known.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top