I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.
I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).
This issue is as follows, using these parameters I will build a transaction/customer/product table:
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
PropCustTypes = c(.10, .45, .30, .15), # Probability for being in each group.
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability for each group
ByDirectCustomer = c(0.60, .30, 0.15, 0.05), # of coming through channel X
ByPartnerBlog = c(0.30, .30, 0.35, 0.35), #
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
# Some other parameters for later use.
NumDays = 1000
NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.)
SalesMultiplierWeekends = 1.5 # For example, I want more in weekends
StartDate <- as.Date("2009-01-04")
NumProducts <- 150
AnnualGrowth <- .1 # I also want an annual growth trend
I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.
days <- data.frame( # Define the days
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend
CustomerRate = NumTransactions/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends)
days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth)
Next I generate the transactions using this table:
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA, # Will be assigned later, NewCustomer: 0.8 and .2
ProductID = NA, # insinuate new/existing customers above
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.
StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12)
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 100, 50))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15 # Cap ProductPrice at 10$
Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.
Customers <- data.frame(
ID=(1:sum(Transactions$NewCustomer)),
CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer),
replace=TRUE, prob=Parameters$PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now
Transactions$Discount <- rnorm(length(Transactions$ID), # assigned to each transaction
Transactions$Discount,Transactions$Discount/20)
Transactions$Timeliness <- rnorm(length(Transactions$ID),
Transactions$Timeliness, Transactions$Timeliness/6)
Now the performance issues start to arise, the for loop:
for (i in 1:nrow(Transactions)){
# Only sample customers which share the same 'CustomerType' as the transaction
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.
- For loop / filling 100 rows: ~ 18 Seconds
- For loop / filling 200 rows: ~ 37 Seconds
- For loop / filling 1000 rows: ~ 3 minutes
- For loop / filling 300000 rows: No idea, can't get that far?
Why is it running so slow and how can I solve this? Any help is greatly appreciated.