Question

I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.

I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).

This issue is as follows, using these parameters I will build a transaction/customer/product table:

set.seed(1)   # Set seed to make reproducible
Parameters <- data.frame(
  CustomerType  = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  PropCustTypes = c(.10, .45, .30, .15),   # Probability for being in each group.
  BySearchEngine   = c(0.10, .40, 0.50, 0.6),   # Probability for each group  
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),  # of coming through channel X
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35), #
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
  stringsAsFactors=FALSE)
# Some other parameters for later use.
NumDays = 1000
NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.)
SalesMultiplierWeekends = 1.5 # For example, I want more in weekends
StartDate <- as.Date("2009-01-04")
NumProducts <- 150
AnnualGrowth <- .1 # I also want an annual growth trend

I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.

days <- data.frame( # Define the days
  day            = StartDate+1:NumDays, 
  DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend
  CustomerRate = NumTransactions/NumDays)    

days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <-  # Increase sales in weekends
  as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends)
days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth)

Next I generate the transactions using this table:

Transactions <- data.frame(
  ID           = 1:sum(days$nPurchases),
  Date         = rep(days$day, times=days$nPurchases),
  CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes),
  NewCustomer  = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
  CustomerID   = NA, # Will be assigned later, NewCustomer: 0.8 and .2 
  ProductID = NA,    # insinuate new/existing customers above
  ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)

Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.

StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12)
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
  ID = 1:NumProducts,
  DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
                                         as.numeric(Upper-Lower),
                                         as.numeric(Upper-Lower)/NumProducts))),
  SuggestedPrice = rnorm(NumProducts, 100, 50))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15   # Cap ProductPrice at 10$

Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.

Customers <- data.frame(
  ID=(1:sum(Transactions$NewCustomer)), 
  CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer),
                        replace=TRUE, prob=Parameters$PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)

I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:

ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now
Transactions$Discount <- rnorm(length(Transactions$ID),           # assigned to each transaction
                               Transactions$Discount,Transactions$Discount/20)
Transactions$Timeliness <- rnorm(length(Transactions$ID),
                               Transactions$Timeliness, Transactions$Timeliness/6)

Now the performance issues start to arise, the for loop:

for (i in 1:nrow(Transactions)){
# Only sample customers which share the same 'CustomerType' as the transaction
  Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
                                          1,replace=FALSE)

# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
  Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
                                    prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
  CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
  ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
  Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}

This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.

  • For loop / filling 100 rows: ~ 18 Seconds
  • For loop / filling 200 rows: ~ 37 Seconds
  • For loop / filling 1000 rows: ~ 3 minutes
  • For loop / filling 300000 rows: No idea, can't get that far?

Why is it running so slow and how can I solve this? Any help is greatly appreciated.

Was it helpful?

Solution

Below is how you would do the first part using data.table, adding CustomerID to the Transactions table. I have changed some names and dropped the placeholder columns as they will be added through the data.table joins.

Tr <- data.table(Transactions)
Tr[, CustomerID:=NULL]
Tr[, ProductID:=NULL]
Tr[, ReferredBy:=NULL]  ## see @Arun's comment for a more compact way to do this

Cs <- data.table(Customers)
setnames(Cs, 'ID', 'CustomerID')  ## So we avoid duplicate with Tr

## Add customer ID, matching customer types
setkey(Tr, CustomerType)
setkey(Cs, CustomerType)

# Make an index Transaction ID -> Customer ID
# Large interim matrix should not be formed, but I am not sure
TrID2CustID <- Cs[Tr, allow.cartesian=T][, list(CustomerID=sample(CustomerID, 1)), by=ID]
setkey(TrID2CustID, ID)
setkey(Tr, ID)
Tr <- Tr[TrID2CustID]

There is a large matrix that is the cartesian product of your Transactions and Customers tables (about 15M rows) which would exhaust the memory if it is explicitly computed. Judging by the fact that this takes about a second, I'd say it is not computed, but I am not sure.

I will work on the rest and edit the answer if I come up with the solutions quickly, but this ought to show you how to do this using data.table.

UPDATE 1: adding ReferredBy

Since the referral probabilities only vary by CustomerType, you can generate the referrals in blocks with replacement (much faster than by individual ID)

setkey(Tr, CustomerType)
Tr[, ReferredBy:=sample(ReferredByOptions, replace=TRUE, size=.N,
                        prob=c(BySearchEngine[1], 
                               ByDirectCustomer[1],
                               ByPartnerBlog[1])),
   by=CustomerType]

UPDATE 2: adding ProductID

This is proving trickier to do in a neat cartesian-product sort of way. I cannot think of an elegant way to generate the 31 dates (-15:15) for each purchase (melted matrix would probably be too big). The code below works as intended but is not as fast as the previous 2:

Pr <- data.table(Products)
setnames(Pr, 'ID', 'ProductID')    ## not necessary here, but good practice
CenteredAround <- as.Date(Tr$Date - 30*Tr$Timeliness)

setkey(Tr, ID)
Tr[, ProductID:=sample(Pr[abs(Pr$DateReleased - 
                              CenteredAround[.I]) <= 15, ProductID], 1), by=ID]

OTHER TIPS

A very simple optimization is to avoid modifying the data frame in the loop, as others have suggested. At least prior to R3.1, modifying a data frame is really expensive, so that's the last thing you want to be doing in a loop. Also, based on Hadley's comments and release notes for R3.1, it may be the case that modifying data frames is not as expensive with R3.1, but I haven't tested.

Here we get around the data frame modification by storing interim results in vectors, and then only inserting into the data frame after the loop. Consider:

system.time({
  custId <- Transactions$CustomerID
  refBy <- Transactions$ReferredBy
  productID <- Transactions$ProductID

  for (i in 1:100){
    # Only sample customers which share the same 'CustomerType' as the transaction
    custId <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
                     1,replace=FALSE)

    # Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
    refBy <- sample(ReferredByOptions,1,replace=FALSE,
                    prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
    # Only sample products in the required range to maintain the 'timeliness' parameter.
    CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
    ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
    productID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
  }
  Transactions$CustomerID <- custId
  Transactions$ReferredBy <- refBy
  Transactions$ProductID <- productID      
})

Which times in at:

user  system elapsed 
0.66    0.06    0.71 

The corresponding time with your original code is:

user  system elapsed 
5.01    1.78    6.79 

So close to a 10x improvement with a minor change (avoiding modifying the data frame repeatedly).

I'm sure you can get further improvements, but this is a real low hanging fruit you can easily implement.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top