Question

I'm trying to build a predictive model with a customer database.

I have a dataset with 3,000 customers. Each customers have 300 observations and 20 variables (including dependent variable) in a test dataset. I also have a score dataset that has 50 observation with 19 variables (excludes dependent variable) for each unique cutomer ID. I have the test dataset in a separate file with each customer identified by a unique ID variable similarly the score dataset is identified by a unique id variable.

I'm developing a RandomForest based predictive model. Below is the sample for a single customer. I'm not sure how I could automatically apply to the model for each customer and predict and store the model effeciently as well.

    install.packages(randomForest)
    library(randomForest)
    sales <- read.csv("C:/rdata/test.csv", header=T)
    sales_score <- read.csv("C:/rdata/score.csv", header=T)

  ## RandomForest for Single customer

    sales.rf <- randomForest(Sales ~ ., ntree = 500, data = sales,importance=TRUE)
    sales.rf.test <- predict(sales.rf, sales_score)

I have very good familiarity with SAS and beginning to learn R. For SAS progremmers, there are many SAS procedures that come with by group processing for example:

proc gam data = test;
by id;
model y = x1  x2 x3;
score data = test  out = pred;
run;

This SAS program would develop a gam model for each unique iD and apply them to the test set for each unique ID. Is there an R equivalent ?

I would greatly appreciate any example or thoughts?

Thanks so much

Was it helpful?

Solution

Assuming your sales dataset is 3,000 * 300 = 900,000 rows and both dataframes have a customer_id column, you can do something like:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
# pred_groups is now a list, with names the customer_id's and each list
# element an integer vector of row numbers. Now iterate over each customer
# and make predictions on the training set.
preds <- unsplit(structure(lapply(names(pred_groups), function(customer_id) {
  # Train using only observations for this customer.
  # Note we are comparing character to integer but R's natural type
  # coercion should still give the correct answer.
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)

  # Now make predictions only for this customer.
  predict(sales.rf, sales_score[pred_groups[[customer_id]], ])
}), .Names = names(pred_groups)), sales_score$customer_id)

print(head(preds)) # Should now be a vector of predicted scores of length
  # the number of rows in the train set.

Edit: Per @joran, here is a solution with a for:

pred_groups <- split(seq_len(nrow(sales_score)), sales_score$customer_id)
preds <- numeric(nrow(sales_score))
for(customer_id in names(pred_groups)) {
  train_rows <- sales$customer_id == customer_id
  sales.rf <- randomForest(Sales ~ ., ntree = 500,
                           data = sales[train_rows, ],importance=TRUE)
  pred_rows <- pred_groups[[customer_id]]
  preds[pred_rows] <- predict(sales.rf, sales_score[pred_rows, ])
})
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top