Is GLM a statistical or machine learning model?

https://datascience.stackexchange.com/questions/488

16-10-2019
|

Pergunta

I thought that generalized linear model (GLM) would be considered a statistical model, but a friend told me that some papers classify it as a machine learning technique. Which one is true (or more precise)? Any explanation would be appreciated.

Solução

A GLM is absolutely a statistical model, but statistical models and machine learning techniques are not mutually exclusive. In general, statistics is more concerned with inferring parameters, whereas in machine learning, prediction is the ultimate goal.

Outras dicas

Regarding prediction, statistics and machine learning sciences started to solve mostly the same problem from different perspectives.

Basically statistics assumes that the data were produced by a given stochastic model. So, from a statistical perspective, a model is assumed and given various assumptions the errors are treated and the model parameters and other questions are inferred.

Machine learning comes from a computer science perspective. The models are algorithmic and usually very few assumptions are required regarding the data. We work with hypothesis space and learning bias. The best exposition of machine learning I found is contained in Tom Mitchell's book called Machine Learning.

For a more exhaustive and complete idea regarding the two cultures you can read the Leo Breiman paper called Statistical Modeling: The Two Cultures

However what must be added is that even if the two sciences started with different perspectives, both of them now now share a fair amount of common knowledge and techniques. Why, because the problems were the same, but the tools were different. So now machine learning is mostly treated from a statistical perspective (check the Hastie,Tibshirani, Friedman book The Elements of Statistical Learning from a machine learning point of view with a statistical treatement, and perhaps Kevin P. Murphy 's book Machine Learning: A probabilistic perspective, to name just a few of the best books available today).

Even the history of the development of this field show the benefits of this merge of perspectives. I will describe two events.

The first is the creation of CART trees, which was created by Breiman with a solid statistical background. At approximately the same time, Quinlan developed ID3,C45,See5, and so on, decision tree suite with a more computer science background. Now both this families of trees and the ensemble methods like bagging and forests become quite similar.

The second story is about boosting. Initially they were developed by Freund and Shapire when they discovered AdaBoost. The choices for designing AdaBoost were done mostly from a computational perspective. Even the authors did not understood well why it works. Only 5 years later Breiman (again!) described the adaboost model from a statistical perspective and gave an explanation for why that works. Since then, various eminent scientists, with both type of backgrounds, developed further those ideas leading to a Pleiads of boosting algorithms, like logistic boosting, gradient boosting, gentle boosting ans so on. It is hard now to think at boosting without a solid statistical background.

Generalized Linear Models is a statistical development. However new Bayesian treatments puts this algorithm also in machine learning playground. So I believe both claims could be right, since the interpretation and treatment of how it works could be different.

In addition to Ben's answer, the subtle distinction between statistical models and machine learning models is that, in statistical models, you explicitly decide the output equation structure prior to building the model. The model is built to compute the parameters/coefficients.

Take linear model or GLM for example,

y = a1x1 + a2x2 + a3x3

Your independent variables are x1, x2, x3 and the coefficients to be determined are a1,a2,a3. You define your equation structure this way prior to building the model and compute a1,a2,a3. If you believe that y is somehow correlated to x2 in a non-linear way, you could try something like this.

y = a1x1 + a2(x2)^2 + a3x3.

Thus, you put a restriction in terms of the output structure. Inherently statistical models are linear models unless you explicitly apply transformations like sigmoid or kernel to make them nonlinear (GLM and SVM).

In case of machine learning models, you rarely specify output structure and algorithms like decision trees are inherently non-linear and work efficiently.

Contrary to what Ben pointed out, machine learning models aren't just about prediction, they do classification, regression etc which can be used to make predictions which are also done by various statistical models.

GLM is absolutely a statistical model , while more and more statistical methods have being applied in industrial production as machine learning tricks . Meta-analysis which I read the most during these days is a good example in statistical field .

A perfect industrial application with GLM can explain why your friend told you that GLM was regarded as a machine learning technique . You can refer the source paper http://www.kdd.org/kdd2016/papers/files/adf0562-zhangA.pdf about that .

I implemented a simplified one which was treated as the main framework for my recommendation system in production scenario few weeks ago . Much appreciated if you give me some tips , and you can check the source code : https://github.com/PayneJoe/algo-sensetime/blob/master/src/main/scala/GLMM.scala

Hopes this will helps you , good day !

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange