Question

I have some (financial time-series) data in a tall format:

require(data.table)
DT <- data.table(Variable=c(rep("a",times = 3), rep("b", times=3)), 
                 Date=as.Date(c("2014-04-01","2014-04-02","2014-04-03"
                                ,"2014-04-02", "2014-04-03","2014-04-04")),
                 Value=c(1:3,3:1), key=c("Variable","Date"))

DT

   Variable       Date Value
1:        a 2014-04-01     1
2:        a 2014-04-02     2
3:        a 2014-04-03     3
4:        b 2014-04-02     3
5:        b 2014-04-03     2
6:        b 2014-04-04     1

I would like to calculate a third variable spread, where spread = a - b for each common Date row (basically a spread between two time-series - a common transformation in the financial domain).

Desired output:

   Variable       Date Value
1:   spread 2014-04-02    -1
2:   spread 2014-04-03     1

I know of solution when I dcast.data.table the data into the wide format (i.e. into table with columns c("Date", "a", "b")), but due to performance issues on large data, is there an elegant way how to do this directly in the tall format using a)dplyr and b)data.table (two-part question)?

Ideally on the dplyr side, I am looking for something as expressive as mutate(tbl_dt(DT, tall=TRUE), spread=a-b). (Disclaimer: I am a complete newbie in dplyr)

Real-life dataset:

# download 200 stocks from Quandl.com. requires free registration
library(Quandl); library(data.table); library(plyr)
ntickers <- 200 ; auth.token="register_free_to_obtain_token"
code.file <- tempfile()
download.file("https://s3.amazonaws.com/quandl-static-content/quandl-stock-code-list.csv",
              destfile=code.file)
tickers <- na.omit(read.csv2(code.file, sep=",", stringsAsFactors=FALSE)[,"Price.Code"])
lst <- na.omit(tickers)[1:ntickers]
names(lst) <- lst
Q <- ldply(lst, Quandl, 
           type = "raw", end_date="2014-04-08", 
           sort="asc", auth=auth.token) # might take minutes
DT <- as.data.table(Q)[,Date:=as.IDate(Date)]
setnames(DT, ".id", "Instrument")
setkey(DT, Instrument, Date)

> dim(DT); object.size(DT)
[1] 685512      8
41145752 bytes
> DT
              Instrument       Date Open High  Low Close Volume Adjusted Close
     1:    GOOG/AMEX_ABI 1981-03-11   NA   NA 6.56  6.75 217200             NA
     2:    GOOG/AMEX_ABI 1981-03-12   NA   NA 6.66  6.88 616400             NA
     3:    GOOG/AMEX_ABI 1981-03-13   NA   NA 6.81  6.84 462000             NA
     4:    GOOG/AMEX_ABI 1981-03-16   NA   NA 6.81  7.00 306400             NA
     5:    GOOG/AMEX_ABI 1981-03-17   NA   NA 6.88  6.88 925600             NA
    ---                                                                       
685508: YAHOO/TSX_AHX_TO 2014-04-02 0.75 0.75 0.75  0.75   5000           0.75
685509: YAHOO/TSX_AHX_TO 2014-04-03 0.79 0.82 0.75  0.82  25700           0.82
685510: YAHOO/TSX_AHX_TO 2014-04-04 0.81 0.81 0.78  0.80   4500           0.80
685511: YAHOO/TSX_AHX_TO 2014-04-07 0.80 1.05 0.80  0.96  40400           0.96
685512: YAHOO/TSX_AHX_TO 2014-04-08 0.95 0.96 0.90  0.95  21300           0.95
Was it helpful?

Solution

You ought to be able to work with this:

> merge(DT["a", ], DT["b",], by="Date")
         Date Variable.x Value.x Variable.y Value.y
1: 2014-04-02          a       2          b       3
2: 2014-04-03          a       3          b       2

The help page for merge.data.table suggests you read FAQ 1.12 for a detailed comparison of this with X[Y,...] approaches.

OTHER TIPS

Here's one approach with dplyr. First we create the data:

require(dplyr)

df <- data.frame(
  Variable = rep(c("a", "b"), each = 3), 
  Date = rep(as.Date("2014-04-01") + 0:2, 2),
  Value = c(1:3, 3:1)
)

Instead of rotating into a wide form, we could instead use a vectorised comparison:

df %.% 
  group_by(Date) %.%
  summarise(spread = Value[Variable == "a"] - Value[Variable == "b"])

## Source: local data frame [3 x 2]
## 
##         Date spread
## 1 2014-04-01     -2
## 2 2014-04-02      0
## 3 2014-04-03      2

This will correctly fail if there are multiple values of a or b, because summarise() requires that results are of length one. The same approach would work with data.table, but you'd need to be a little more careful about checking the results (because data table is less strict/more flexible here compared to dplyr).

You could also use the join approach suggested by BondedDust. It's not as quite as convenient with dplyr as it is with data.table:

a <- df %.% filter(Variable == "a") %.% select(-Variable)
b <- df %.% filter(Variable == "b") %.% select(-Variable)

inner_join(a, b, by = "Date") %.%
  mutate(spread = Value.x - Value.y)

##         Date Value.x Value.y spread
## 1 2014-04-01       1       3     -2
## 2 2014-04-02       2       2      0
## 3 2014-04-03       3       1      2

Here's a data.table method by using dcast.data.table.

I hope I provide an useful start, and there're some following issues about selecting NAs, and speed gain.

# Create Dataset
require(data.table)
require(reshape2)
DT <- data.table(Variable=c(rep("a",times = 3), rep("b", times=3)), 
             Date=as.Date(c("2014-04-01","2014-04-02","2014-04-03"
                            ,"2014-04-02", "2014-04-03","2014-04-04")),
             Value=c(1:3,3:1), key=c("Variable","Date"))

# using data.table
DT2 <- dcast.data.table(DT, Date ~ Variable, drop=FALSE) 
DT2[, spread:= a-b, by = Date][!is.na(spread),]
# Actually I'm not clear about the different between `drop= FALSE` and `drop = TRUE` 

This is the output

         Date a b spread
1: 2014-04-02 2 3     -1
2: 2014-04-03 3 2      1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top