Get difference to previous date in ffdf object

https://stackoverflow.com/questions/13397473

flat-file
r

29-11-2021
|

Domanda

sometimes ago I asked the following question:

I have a list of deals with trading day and market value. Every (Trading)day new positions come into the list but the old one never disappear (when positions expire the value stays just constant). The list looks like as follows:
Deal Trade_Date MktValue Desired_Col
Deal1 31.08.2012 10 +10
Deal2 31.08.2012 21 +21
Deal1 03.09.2012 12 +2
Deal2 03.09.2012 19 -2
Deal3 03.09.2012 2  +2
I would like for each deal to get the difference to the previous trade date (Desidered_Col in the above example).

And the following solution was provided to me by Roland:

df <- read.table(text="Deal Trade_Date MktValue Desidered_Col Deal1 31.08.2012 10 +10 Deal2 31.08.2012 21 +21 Deal1 03.09.2012 12 +2 Deal2 03.09.2012 19 -2 Deal3 03.09.2012 2 +2",header=TRUE)

library(data.table) dt <- as.data.table(df)

diff.padded <- function(x) c(x[1],diff(x)) dt[,Desidered_Col2:=diff.padded(MktValue),by=Deal]
    Deal Trade_Date MktValue Desired_Col Desired_Col2
1: Deal1 31.08.2012       10            10             10
2: Deal2 31.08.2012       21            21             21
3: Deal1 03.09.2012       12             2              2
4: Deal2 03.09.2012       19            -2             -2
5: Deal3 03.09.2012        2             2              2

The solution works perfectly with data.table. However given the size of my table I decided to try to work with an ffdf object. Hence I have now my data in a ffdf file and I am trying to reproduce the same solution unfortunately without success. Do you have any advise how I can reproduce that in a ffdf? Thanks for your help.

here is the full code I am running:

# Load needed packages
library(RODBC)
library(data.table)
library(ETLUtils)
library(RSQLite)
library(ffbase)

calendar <- read.csv("Trading_Calendar.csv",sep=";",stringsAsFactors=FALSE)
calendar$STICHTAG <- as.Date(calendar$STICHTAG,"%d.%m.%Y")

ST_a=Sys.Date()-2
rd_a=as.Date("13.11.2012","%d.%m.%Y")
ST=paste("'",as.character(format(ST_a,"%d.%m.%Y")),"'",sep="")
rd=paste("'",as.character(format(rd_a,"%d.%m.%Y")),"'",sep="")

gc(TRUE)

st.strom <- calendar[calendar$STICHTAG>=rd_a & calendar$STICHTAG<=ST_a &   calendar$BR_Strom==1,"STICHTAG"]
st.strom <- format(st.strom,"%d.%m.%Y")
st.strom.s <- paste("('",do.call(paste, c(as.list(as.character(st.strom)), sep="','")),"')",sep="")


started.at=proc.time()
Sys.sleep(1)

memory.limit(size=4095)


query <- paste("select * from is_bewertung_data where commodity in ('CASH','COAL','CO2','ELEC','GCERT') 
               and stichtag in ",st.strom.s,sep="")

deals.strom <- read.odbc.ffdf(query = query,odbcConnect.args=list(dsn="dsn",uid="id",pwd="pwd"),
                       first.rows = 100000, next.rows = 500000, VERBOSE=TRUE)

result <- ffdfdply(deals.strom, deals.strom$DEALID, FUN=function(x){ 
  x <- split(x, x$DEALID)
  x <- lapply(x, FUN=function(onlyonedeal){
    onlyonedeal$Desidered_Col2 <- c(NA, -diff(onlyonedeal$STICHTAG))
    onlyonedeal
  })
  x <- do.call(rbind, x)      
  x
})
cat("Finished in",timetaken(started.at),"\n")

here the result of str(deals.strom[1:5,]):

'data.frame':   5 obs. of  39 variables:
 $ ABBREVIATION   : Factor w/ 33553 levels " C 251"," TÜV EE Donaustrom",..: 1893 1892 1894 1895 1896
 $ TRADEDATE      : POSIXct, format: "2007-06-19" "2007-06-19" "2007-06-19" ...
 $ BOOK           : Factor w/ 30 levels "CR_RIR_RISKRED",..: 10 10 10 10 10
 $ CONTRACT       : Factor w/ 20 levels "Base","DNULL",..: 1 5 5 1 1
 $ BUYSELL        : Factor w/ 2 levels "BUY","SELL": 2 1 2 1 1
 $ RATE           : num  54.2 57.2 57.3 54.2 55.1
 $ AMOUNT         : num  474792 501072 501773 474792 964476
 $ CUR            : Factor w/ 2 levels "EUR","USD": 1 1 1 1 1
 $ VOLUME         : num  8760 8760 8760 8760 17520
 $ UNIT           : Factor w/ 2 levels "MWH","t": 1 1 1 1 1
 $ STARTDATE      : POSIXct, format: "2010-01-01" "2010-01-01" "2010-01-01" ...
 $ ENDDATE        : POSIXct, format: "2011-01-01" "2011-01-01" "2011-01-01" ...
 $ BROKERAGE      : num  0 0 0 0 175
 $ DV             : num  85078 -98218 98919 -85078 -185048
 $ REALIZED       : num  85078 -98218 98919 -85078 -185048
 $ PV             : num  0 0 0 0 0
 $ DV_DAY         : num  0 0 0 0 0
 $ DV_MONTH       : num  0 0 0 0 0
 $ DV_YEAR        : num  0 0 0 0 0
 $ TRADER         : Factor w/ 16 levels "Adolf Plentz",..: 7 7 7 7 12
 $ ACTIVE         : Factor w/ 2 levels "LONGTERM","SHORTTERM": 2 2 2 2 2
 $ STATUS         : Factor w/ 2 levels "GCPTY","INT": 1 1 2 2 1
 $ PV_MIN         : num  0 0 0 0 0
 $ PV_PLUS        : num  0 0 0 0 0
 $ VERTRAGSPARTY  : Factor w/ 21 levels "EDL_G059","EDL_G097",..: 10 10 3 3 10
 $ GESELLSCHAFT   : Factor w/ 1 level "24/7 Trading": 1 1 1 1 1
 $ COMMODITY      : Factor w/ 5 levels "CASH","CO2","COAL",..: 4 4 4 4 4
 $ TO_BE_DELIVERED: num  0 0 0 0 0
 $ ACCOUNT        : Factor w/ 8 levels "CR_RISKRED","HO_COAL",..: 5 5 5 5 5
 $ VERW_PREIS     : num  0 0 0 0 0
 $ PV_ND          : num  0 0 0 0 0
 $ BILANZIERUNG   : Factor w/ 2 levels "JA","NEIN": 1 1 1 1 1
 $ MOTIV          : Factor w/ 8 levels "Emissionszertifikate",..: 4 4 4 4 4
 $ STICHTAG       : POSIXct, format: "2012-11-13" "2012-11-13" "2012-11-13" ...
 $ DEALID         : Factor w/ 59704 levels "FUX.E.EEX.K.20090622.002",..: 7175 7103 12584 12500 17985
 $ COUNTERPARTY   : Factor w/ 174 levels "24sieben GmbH",..: 171 171 53 53 141
 $ COMMODITY2     : Factor w/ 8 levels "CASH","CER","COAL",..: 4 4 4 4 4
 $ MARKTGEBIET    : Factor w/ 3 levels "Kohle","Strom",..: 2 2 2 2 2
 $ INSTRUMENT     : Factor w/ 88 levels "-","Elektrizität FUX EEX Base Apr11 EEXFUT",..: 1 1 1 1 1

my solution after Jan hint, not working:

test <- as.ffdf(deals.strom[,c("DEALID","STICHTAG","PV")])
test <- transform(test,chg=c(NA,diff(PV)),chg2=c(NA,-diff(PV)))
fdd <- as.ff(!duplicated(test$DEALID))
test[fdd,c("chg","chg2")] <- test[fdd,"PV"]

I get the following error msg:error: is.null(rownames(x)) is not TRUE. Somehow I cannot manage to subset the ffdf.

Soluzione

Have you tried out ffdfdply in package ffbase? See e.g. here an example on how to use it. R language: problems computing "group by" or split with ff package.

So in your case do something like (I'm freewheeling here based on your example script, but you should understand the point of split-apply-combine in an ffdf setting)

require(ffbase)
result <- ffdfdply(deals[c("Deal","Trade_Date")], deals$Deal, FUN=function(x){ 
  x$Deal <- as.character(x$Deal)
  x <- split(x, x$Deal)
  x <- lapply(x, FUN=function(onlyonedeal){
    onlyonedeal$Desidered_Col2 <- c(NA, -diff(onlyonedeal$Trade_Date))
    onlyonedeal
  })
  x <- do.call(rbind, x)      
  x
})

Another solution would be. This doesn't use split-apply-rbind explicitely inside FUN.

require(ffbase)
require(doBy)
result <- ffdfdply(deals[c("DEALID","STICHTAG")], deals$DEALID, FUN=function(x){ 
  x <- orderBy(~ DEALID + STICHTAG, data = x)
  x$Desidered_Col2 <- c(NA, -diff(as.Date(x$STICHTAG)))
  firstdealdate <- !duplicated(x$DEALID)
  x$Desidered_Col2[firstdealdate] <- NA
  x
})

Altri suggerimenti

Hi I found the following solution. It is working but I would appreciate if you have a more elegant solution. I´m still forced to use objects in RAM and I am worried that if data size increases I´ll have to process the data in pieces (which even less elegant as solution). The data is stored in a ffdf file. I have circa 21Mio. rows and 39 columns.

deals # ffdf with 21Mio. rows and 39 columns
deals <- ffdfsort(deals)

deals <- transform(deals, delta_MktValue=0)
diff.padded <- function(x) c(x[1],diff(x))
delta <- data.table(deals[,c("Deal","Trade_Date","MktValue")])

diff <- delta[,diff.padded(MktValue),by=Deal]

deals[,"delta_MktValue"] <- diff[,V1]

rm(diff)
rm(delta)
rm(delta_PV)
gc()

It is actually working but I would appreciate if someone can suggest a more elegant solution. In particular I would like to perform the calculation directly in the ffdf. Thanks!

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow