R data.table merge / full outer join with na.fill / nomatch based on formula

https://stackoverflow.com/questions/18314790

25-06-2022
|

Question

What I need is to perform full outer join with some kind of smart na.fill / nomatch in a efficient way. I've already done it using loop but I would like to use matrix algebra or data.table operations to speed up the process.

Data below are sample of stock open orders information, full outer join is performed between datasets of asks open orders and bids open orders. A dataset are asks, B are bids. Both datasets stores atomic orders and their cumulative sums. The task is to match all ask orders with bid orders by cumulative value and vice versa. Populate example data:

price = c(11.25,11.26,11.35,12.5,14.2)
amount = c(1.2,0.4,2.75,6.5,15.2)
A <- data.table(ask_price = price, ask_amount = amount, ask_cum_amount = cumsum(amount), cum_value = cumsum(price*amount), ask_avg_price = cumsum(price*amount)/cumsum(amount))
price = c(11.18,11.1,10.55,10.25,9.7)
amount = c(0.15,0.6,10.2,3.5,12)
B <- data.table(bid_price = price, bid_amount = amount, bid_cum_amount = cumsum(amount), cum_value = cumsum(price*amount), bid_avg_price = cumsum(price*amount)/cumsum(amount))

regular full outer join and it's results:

setkey(A, cum_value)
setkey(B, cum_value)
C <- merge(A,B,all=TRUE)
print(C)

na.fill / nomatch pseudocode formula, for every row (ask or bid) where cum_value not matches (please keep in mind that every other field than cum_value is related to ask OR bid):

avg_price["current NA"] <- cum_value["last non NA"]/cum_value["current NA"] * avg_price["last non NA"] + (1-cum_value["last non NA"]/cum_value["current NA"]) * price["next non NA"]
cum_amount["current NA"] <- cum_value["current NA"] / avg_price["current NA"]

expected results:

D <- data.table(
  cum_value = c(1.677,8.337,13.5,18.004,49.2165,115.947,130.4665,151.822,268.222,346.3065),
  ask_price = c(NA,NA,11.25,11.26,11.35,NA,12.5,NA,NA,14.2),
  ask_amount = c(NA,NA,1.2,0.4,2.75,NA,6.5,NA,NA,15.2),
  ask_cum_amount = c(0.149066666666667,0.741066666666667,1.2,1.6,4.35,9.66496172396059,10.85,12.3126600707381,20.4097766460076,26.05),
  ask_avg_price = c(11.25,11.25,11.25,11.2525,11.31414,11.9966331281534,12.02456,12.3305605066459,13.1418390633132,13.29392),
  bid_price = c(11.18,11.1,NA,NA,NA,10.55,NA,10.25,9.7,NA),
  bid_amount = c(0.15,0.6,NA,NA,NA,10.2,NA,3.5,12,NA),
  bid_cum_amount = c(0.15,0.75,1.23858478466587,1.66517233847558,4.6230572556498,10.95,12.3652404387114,14.45,26.45,NA),
  bid_avg_price = c(11.18,11.116,10.8995364444444,10.8120940902022,10.6458772362927,10.58877,10.5510685899445,10.50671,10.14072,NA)
)
print(D)

Note that in the expected results the last NA is still as NA, this is because opposite order could not be matched because the market depth is not enough to fulfill the order at any price.

Is it possible to get expected results using matrix algebra or data.table operations or any other efficient way to avoid looping over full dataset?

Thanks in advance

Solution

Merge it back again with A and B with a roll to find the last/next non-NA prices.

E.g. see the output values of bid_avg_price for these two merges:

B[merge(A, B, all = T), roll = Inf]
B[merge(A, B, all = T), roll = -Inf]

That should give you all the info you need to compute those quantities.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow