Question

I have a big data problem with a large dataset (take for example 50 million rows and 200 columns). The dataset consists of about 100 numerical columns and 100 categorical columns and a response column that represents a binary class problem. The cardinality of each of the categorical columns is less than 50.

I want to know a priori whether I should go for deep learning methods or ensemble tree based methods (for example gradient boosting, adaboost, or random forests). Are there some exploratory data analysis or some other techniques that can help me decide for one method over the other?

Was it helpful?

Solution

Why restrict yourself to those two approaches? Because they're cool? I would always start with a simple linear classifier \ regressor. So in this case a Linear SVM or Logistic Regression, preferably with an algorithm implementation that can take advantage of sparsity due to the size of the data. It will take a long time to run a DL algorithm on that dataset, and I would only normally try deep learning on specialist problems where there's some hierarchical structure in the data, such as images or text. It's overkill for a lot of simpler learning problems, and takes a lot of time and expertise to learn and also DL algorithms are very slow to train. Additionally, just because you have 50M rows, doesn't mean you need to use the entire dataset to get good results. Depending on the data, you may get good results with a sample of a few 100,000 rows or a few million. I would start simple, with a small sample and a linear classifier, and get more complicated from there if the results are not satisfactory. At least that way you'll get a baseline. We've often found simple linear models to out perform more sophisticated models on most tasks, so you want to always start there.

OTHER TIPS

In addition to other answers (and there's some good link in the comments) it depends on what the problem is or what kinds of questions you want to answer. As I can only suggest based on my own experience, then in case of a classification task, the possible methods can be severely limited based on class balance in dataset.

Once you go to a larger than around 1:10 class imbalance, then most classification methods just stop working. You'll be left with methods based on random forest and maybe neural nets (haven't tried yet). I work with the class balance in the range of 1:500 to 1:1000 and have found that neither down- or upsampling works. Luckily my dataset is "only" 6mln observations by 200 variables and I'm able to run boosted trees on the whole set in reasonable time.

So to directly answer your question:

  • you should come up with a bunch of questions you would want to answer and in case of classification then check the class balances of the target variables.

  • you should check the distribution (not in mathematical sense) of missing values in all of your data and document what you find. Some ML methods are fine with missing values while others are not and you need to look into data imputation (which has its own set of rules and guidelines and problems).

From my perspective, for 5 million instances you need lots of trees to get a good generalization bound (a good model in the layman term). If this is not a problem then go for it,even the exact answer is relying on the nature of your problem. GBT is a good method especially if you have mixed feature types like categorical, numerical and such. In addition, compared to Neural Networks it has lower number of hyperparameters to be tuned. Therefore, it is faster to have a best setting model. One more thing is the alternative of parallel training. You can train multiple trees at the same time with a good CPU. If you are not satisfied with the results then go for Neural Nets since that means your model should be more extensive and should learn higher order information through your data. That is the due of NNs compared to other learning algorithms.

On the lines of what @Simon has already said:

  1. Deep learning approaches have been particularly useful in solving problems in vision, speech and language modeling where feature engineering is tricky and takes a lot of effort.
  2. For your application that does not seem to be the case since you have well defined features and only feature interactions etc. are required.
  3. Given that deep learning models currently need a lot of computing resources and scientist time in coding stuff up I'd suggest opting for a non-deep learning approach.

For your problem the effort vs benefit tradeoff does not seem to be in deep learning's favour. DL would be an overkill

When you have such large data set you can play with any of the statistical and machine learning modelling techniques and that is highly encouraged. As other have suggested I would also recommend to take a few million random samples from data and play with that. Since this is a classification problem I would follow simple classification techniques first and then go on with more complex ones later. Logistic regression is great to start with.

I wanted to add that generative models must also be tried out. Naive Bayes classifier is one of the simplest probabilistic classifiers and it outperforms many complex methods like support vector machines on many tasks. You can look at this simple implementation of NB and a this link for comparison of NB to logistic regression.

One can build a Naive bayes (NB) classifier as a baseline model and then go for any machine learning technique like Support vector machines(SVM) or multilayer perceptrons (MLP). A trade off here is that NB is computationally less expensive than MLP so better performance from MLP is desired.

Coming to your exact query: Deep learning and gradient tree boosting are very powerful techniques that can model any kind of relationship in the data. But what if in your case a simple logistic regression or NB is giving desired accuracy. So its always better to try out the simple techniques first and have a baseline performance. Then one can go for the complex models and compare with the baseline.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top