Question

I am trying to create a "long" data frame of indicator ("dummy") variables out of a very peculiar type of "wide" data frame in R that has interval ranges of years defining my data.

What I have looks like this:

f=data.frame(name=c("A","B","C"),
             year.start=c(1990,1994,1993),year.end=c(1994,1995,1993))

  name year.start year.end
1    A       1990     1994
2    B       1994     1995
3    C       1993     1993

Update: I have changed the value of year.start for A to 1990 from the initial example of 1993 to address some of the answers below which rely on unique values instead of intervals.

What I would like is a long data frame that would look like this, with an entry for each of the possible years in the original data frame, eg, 1990 through 1995 where 1 = present and 0 = absent.

 name year indicator
A     1990   1
A     1991   1
A     1992   1
A     1993   1
A     1994   1
A     1995   0
B     1990   0
B     1991   0
B     1992   0
B     1993   0
B     1994   1
B     1995   1
C     1990   0
C     1991   0
C     1992   0
C     1993   1
C     1994   0
C     1995   0

Try as I might, I don't see how I can do this with Hadley Wickham's reshape2 package.

Thanks!

Was it helpful?

Solution 3

Here's another solution, similar to the ones above, which aims to be straightforward:

zz           <- cbind(name=f[1],year=rep(min(f[-1]):max(f[-1]),each=nrow(f)))
zz$indicator <- as.numeric((f$name==zz$name & 
                            f$year.start<=zz$year & 
                            f$year.end  >=zz$year))
result       <- zz[order(zz$name,zz$year),]

The first line builds a template with all the names and all the years. The second line sets indicator based on whether it is present in the range. The third line just reorders the result.

OTHER TIPS

Someone else might have suggestion for reshape2, but here is a base R solution:

years <- factor(unlist(f[-1]), levels=seq(min(f[-1]), max(f[-1]), by=1))
result <- data.frame(table(years, rep(f[[1]], length.out=length(years))))
#    years Var2 Freq
# 1   1990    A    1
# 2   1991    A    0
# 3   1992    A    0
# 4   1993    A    0
# 5   1994    A    1
# 6   1995    A    0
# 7   1990    B    0
# 8   1991    B    0
# 9   1992    B    0
# 10  1993    B    0
# 11  1994    B    1
# 12  1995    B    1
# 13  1990    C    0
# 14  1991    C    0
# 15  1992    C    0
# 16  1993    C    2
# 17  1994    C    0
# 18  1995    C    0

here is a step-by-step breakdown, using data.table

library(data.table)
f <- as.data.table(f)

## ALL OF NAME-YEAR COMBINATIONS
ALL <- f[, CJ(name=name, year=seq(min(year.start), max(year.end)))]

## WHICH COMBINATIONS EXIST
PRESENT <- f[, list(year = seq(year.start, year.end)), by=name]

## SETKEYS FOR MERGING
setkey(ALL, name, year)
setkey(PRESENT, name, year)

## INITIALIZE INDICATOR TO ZERO, THEN SET TO 1 FOR THOSE PRESENT
ALL[, indicator := 0]
ALL[PRESENT, indicator := 1]

ALL

   name year indicator
1:    A 1993         1
2:    A 1994         1
3:    A 1995         0
4:    B 1993         0
5:    B 1994         1
6:    B 1995         1
7:    C 1993         1
8:    C 1994         0
9:    C 1995         0

Another base R solution

f=data.frame(name=c("A","B","C"),
         year.start=c(1993,1994,1993),year.end=c(1994,1995,1993), stringsAsFactors=F)
x <- expand.grid(unique(f$name),min(f1$year):max(f1$year)) 
names(x) <- c("name", "year")
x$indicator <- sapply(1:nrow(x), function(i) sum(x$name[i]==f$name & x$year[i] >=    f$year.start & x$year[i] <= f$year.end))
x[order(x$name),]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top