Conditionally matching elements in multiple columns of two large datasets with each other

https://stackoverflow.com/questions/22092269

18-10-2022
|

Pergunta

I have two very large datasets for demand and returns of products (about 4 million entries per dataset, but unequal length). The first dataset gives [1] the date of demand, [2] the id of the customer and [3] the id of the product. The second dataset gives the [1] date of return, [2] the id of the customer and [3] the id of the product.

Now I would like to match all demands for given customers and products with the returns of the same customer and product. Pairs of product types and customers are not unique, because customer can demand a product multiple times. Therefore, I want to match a demand for a product with the earliest return in the dataset. It can also happen that some products are not returned, or that some products are returned which have not been demanded (because customers return items that were demanded before the starting data in the dataset).

To that end I've written the following code:

transactionNumber = 1:nrow(demandSet)  #transaction numbers for the demandSet
matchedNumber = rep(0, nrow(demandSet)) #vector of which values in the returnSet correspond to the transactions in the demandSet

for (transaction in transactionNumber){
indices <- which(returnSet[,2]==demandSet[transaction,2]&returnSet[,3]==demandSet[transaction,3]) 
if (length(indices)>0){
    matchedNumber[transaction] <- indices[which.min(returnSet[indices,][,1])] #Select the index of the transaction with the minimum date
} 
}

However, this takes around a day to compute. Anyone have a better suggestion? Note that the suggestions from match two columns with two other columns do not work here, since match() overflows memory.

As a working example consider

demandDates = c(1,1,1,5,6,6,8,8)
demandCustIds = c(1,1,1,2,3,3,1,1)
demandProdIds = c(1,2,3,4,1,5,2,6)
demandSet = data.frame(demandDates,demandCustIds,demandProdIds)

returnDates = c(1,1,4,4,4)
returnCustIds = c(4,4,1,1,1)
returnProdIds = c(5,7,1,2,3)
returnSet = data.frame(returnDates,returnCustIds,returnProdIds)

(This actually doesn't work completely correctly, since transaction 7 is incorrectly matched with return 4, however for the sake of the question lets assume this I what I want... I can fix this later)

Solução

require(data.table)

DD<-data.table(demandSet,key="demandCustIds,demandProdIds")
DR<-data.table(returnSet,key="returnCustIds,returnProdIds")
DD[DR,mult="first"]   

   demandCustIds demandProdIds demandDates returnDates
1:             1             1           1           4
2:             1             2           1           4
3:             1             3           1           4
4:             4             5          NA           1
5:             4             7          NA           1

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow