It is mentioned in the paper, page 12:
The combined-feature detectors above are monolithic – they concatenate the motion and appearance features into a single large feature vector and train a combined classifier on it.
So, you just make one feature vector by concatenation of the two descriptors. Other mentioned possibility is Mixture of Experts:
In our experiments these effects mitigate the losses due to separate training and the linear Mixture of Experts classifier actually performs slightly better than the best monolithic detector. For now the differences are marginal (less than 1%), but the Mixture of Experts architecture provides more flexibility and may ultimately be preferable. The component classifiers could also be combined in a more sophisticated way, for example using a rejection cascade [1, 22, 21] to improve the runtime.
You can read about this method, for example, here.