What is a better approach for cross-validation with time-related predictors

https://datascience.stackexchange.com/questions/15436

16-10-2019
|

문제

I was a given a data set with different predictors about a store and the idea is to forecast the number of daily shoppers. The predictors are the weekday, time of the day (morning, afternoon, evening), week number, month, weather (humidity, dew point, temperature), holidays. The outcome variable is the number of visitors.

I want to build a regression model to predict the number of visitors using traditional machine learning algorithms such as random forests, SVM, and the like.

My main concern is how to validate this model using CV since some of the predictors are time-related. Plain vanilla CV cannot be performed here. In this question, they suggest a way to perform this but my problem is that I only have data from June 2015 to present.

My initial idea was the following:

train with data from June 2015-December 2015. Test with January
train with June 2015-January 2016. Test with February 2016.

Each time one month of data is added to the training data after having asses the error for that month. Then compute average performance.

My questions:

Is this approach reasonable or not?
If so, should I get rid of the month variable? Note that in a., for instance, I am testing with some data that belongs to different months that the one used for training. I mean, for the training I used data from June to December 2015, but I am testing for January 2016. Seasonality can be something I am missing.
How to validate such models in general?

해결책

One such way to handle a time series cross-validation is to take a look at the below Python code from here:

def performTimeSeriesCV(X_train, y_train, number_folds, algorithm, parameters):
"""
Given X_train and y_train (the test set is excluded from the Cross Validation),
number of folds, the ML algorithm to implement and the parameters to test,
the function acts based on the following logic: it splits X_train and y_train in a
number of folds equal to number_folds. Then train on one fold and tests accuracy
on the consecutive as follows:
- Train on fold 1, test on 2
- Train on fold 1-2, test on 3
- Train on fold 1-2-3, test on 4
....
Returns mean of test accuracies.
"""

print 'Parameters --------------------------------> ', parameters
print 'Size train set: ', X_train.shape

# k is the size of each fold. It is computed dividing the number of 
# rows in X_train by number_folds. This number is floored and coerced to int
k = int(np.floor(float(X_train.shape[0]) / number_folds))
print 'Size of each fold: ', k

# initialize to zero the accuracies array. It is important to stress that
# in the CV of Time Series if I have n folds I test n-1 folds as the first
# one is always needed to train
accuracies = np.zeros(folds-1)

# loop from the first 2 folds to the total number of folds    
for i in range(2, number_folds + 1):
    print ''

    # the split is the percentage at which to split the folds into train
    # and test. For example when i = 2 we are taking the first 2 folds out 
    # of the total available. In this specific case, we have to split the
    # two of them in half (train on the first, test on the second), 
    # so split = 1/2 = 0.5 = 50%. When i = 3 we are taking the first 3 folds 
    # out of the total available, meaning that we have to split the three of them
    # in two at split = 2/3 = 0.66 = 66% (train on the first 2 and test on the
    # following)
    split = float(i-1)/i

    # example with i = 4 (first 4 folds):
    #      Splitting the first       4        chunks at          3      /        4
    print 'Splitting the first ' + str(i) + ' chunks at ' + str(i-1) + '/' + str(i) 

    # as we loop over the folds X and y are updated and increase in size.
    # This is the data that is going to be split and it increases in size 
    # in the loop as we account for more folds. If k = 300, with i starting from 2
    # the result is the following in the loop
    # i = 2
    # X = X_train[:(600)]
    # y = y_train[:(600)]
    #
    # i = 3
    # X = X_train[:(900)]
    # y = y_train[:(900)]
    # .... 
    X = X_train[:(k*i)]
    y = y_train[:(k*i)]
    print 'Size of train + test: ', X.shape # the size of the dataframe is going to be k*i

    # X and y contain both the folds to train and the fold to test.
    # index is the integer telling us where to split, according to the
    # split percentage we have set above
    index = int(np.floor(X.shape[0] * split))

    # folds used to train the model        
    X_trainFolds = X[:index]        
    y_trainFolds = y[:index]

    # fold used to test the model
    X_testFold = X[(index + 1):]
    y_testFold = y[(index + 1):]

    # i starts from 2 so the zeroth element in accuracies array is i-2. performClassification() is a function which takes care of a classification problem. This is only an example and you can replace this function with whatever ML approach you need.
    accuracies[i-2] = performClassification(X_trainFolds, y_trainFolds, X_testFolds, y_testFolds, algorithm, parameters)

    # example with i = 4:
    #      Accuracy on fold         4     :    0.85423
    print 'Accuracy on fold ' + str(i) + ': ', acc[i-2]

# the function returns the mean of the accuracy on the n-1 folds    
return accuracies.mean()

If on the other hand, you prefer R you can explore the timeslice method in the caret package and make use of the following code:

library(caret) 
library(ggplot2) 
data(economics) 
myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 36,
                              horizon = 12,
                              fixedWindow = TRUE)

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics,
                    method = "pls",
                    preProc = c("center", "scale"),
                    trControl = myTimeControl)

다른 팁

Your approach 1) is correct, use n ordered data to predict n+1. You will have to identify the correct window for prediction and a model not too flexible if you feel the amount of data you have is little.

Do not forget about feature engineering and data preparation. The seasonality you mention can be removed if you identity it correctly.

Validation of the model is done as usual, with a square loss function

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange