On interpreting the statistical significance of R squared

https://datascience.stackexchange.com/questions/8806

16-10-2019
|

Question

I have performed a linear regression analysis to two series of data, each of which has 50 values. I did the analysis in SPSS and as a result got a table which says that my adjusted R squared is 0.145 and its significance is 0.004.

Being 0.004 < 0.05, I assume my adjusted R squared is significant.

1) Does it mean my adjusted R squared is credible?

2) What does happen if you get a significance which is > 0.05? Does it imply the adjusted R squared can be trusted with credibility but also that the two datasets are not or poorly correlated?

Solution

The p-value is the strength of evidence against the null hypotheses. In this case the null is that the coefficient is equal to zero. So your p-value says that this is very weak evidence against the null so you model is likely to be describing the underlying system of the data.

R-squared describes the percent of variation that is explained by the model. Your value is very low; 14.5%. Of all the "activity" in the data your model is only explaining 14.5% of it.

So you have a situation were model is most likely explaining variation in data but not explaining very much of it. I would suggest altering the model and refitting.

OTHER TIPS

The accepted answer is correct in terms of explaining the interpretation of R^2 as being the amount of variation in the dependent variation caused by the dependent variation. It ranges from 0.0 (0%) to 1.00 (100%) of correlated variation (for a linear regression), so if it 100%, all of the changes in Y (dependent or predictor variable) can be attributed to changes in X (independent r response variable). Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This statistical control that regression provides is important because it isolates the role of one variable from all of the others in the model.

The p-value interpretation is the likelihood one can reject the null hypothesis. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (say < 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable.

Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response.

You will never get a p = 0 in a real scenario, because there is no way to eliminate the case that the null is correct, because there will be some error in the observations due to measurement alone, leaving out other factors. You need to establish how certain you need to be to feel comfortable. There is nothing magical in the p = 0.05 number, it is one that was established as a standard and now is taken as doctrinal by many not understanding it. If you can feel comfortable in your situation with an 80% certainty that the null can be rejected, then there is nothing wrong with that level.

The real reason that I wanted to add an answer is that the other answer doesn't deal with your using adjusted R^2, sometimes called R bar squared. Adjusted R^2 is not R^2 and should not be confused for it. R^2 >= adjusted R^2 and if you are only dealing with correlation between two variables, as hinted at in your question, you should use R^2. R^2 measures the correlation between two variables. Adjusted R^2 has additional factors that attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model.

The adjusted R2 can be negative while R^2 can not be negative. The adjusted R^2 value will always be less than or equal to that of R^2. Unlike R^2, the adjusted R^2 increases only when the increase in R^2 (due to the inclusion of a new explanatory variable) is more than one would expect to see by chance. If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted R^2 computed each time, the level at which adjusted R^2 reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms.

Adjusted R2 is particularly useful in the feature selection stage of model building.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange