سؤال

So I used python to run multi-factorial ANOVA analysis on a data set. I first used a ols.fit() and then the anova_lm function. I realized for the variables I am analyzing their degree of freedom is 1. Does that mean only 1 value out of my data is extracted and used for calculation? Why is the residual df so high?

import pandas as pd
from statsmodels.multivariate.manova import MANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.graphics.factorplots import interaction_plot
import matplotlib.pyplot as plt
from scipy import stats

#Some quick df transformation
#We want to analyze the candy sales numbers with respect to the flavors

formula = 'Sales ~ Mango+Raspberry+Chocolate+Dark_Chocolate+Ice_cream+Cherry' 

model = ols(formula, df).fit() 

aov_table = anova_lm(model_MS_ro, typ=2) 

print(aov_table)

****ANOVA Results****

                      df    sum_sq   mean_sq          F    PR(>F)
Mango                1.0  0.008512  0.008512   2.325284  0.130999
Raspberry            1.0  0.006025  0.006025   1.645954  0.202998
Chocolate            1.0  0.049506  0.049506  13.524418  0.000412
Dark_Chocolate       1.0  0.007233  0.007233   1.976095  0.163447
Ice_cream            1.0  0.018032  0.018032   4.926093  0.029117
Cherry               1.0  0.024460  0.024460   6.682116  0.011444
Residual            85.0  0.311140  0.003660        NaN       NaN
هل كانت مفيدة؟

المحلول

Well having datasets handy would have made it easier to do it myself and explain better.

But let me try to help you get started assuming you are asking to interpret the results rather than trying to solve model/platform/library specific implementation, if not don't bother reading ahead.

enter image description here

As I understood its the linear fitting model here which you are calling with 2-way ANOVA test.So in Linear models each independent variable uses 1 degree of freedom, if there are 6 then 6 DF. Rest are just actually number of observations available which it shows in residual/error. That's why the sum_sq and mean_sq values are same in your output for variables but different for residual.

I found it much more easier to try 2-Way ANOVA in excel first and trying to interpret the results. please try this link with excellent explanation I found long back-

https://statisticsbyjim.com/anova/two-way-anova-excel/

https://statisticsbyjim.com/hypothesis-testing/degrees-freedom-statistics/

Few more links which clearly explains-

https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/anova/how-to/two-way-anova/methods-and-formulas/methods-and-formulas/

https://www.graphpad.com/support/faq/the-anova-table-ss-and-df-in-two-way-anova/

https://people.richland.edu/james/ictcm/2004/twoway.html

And If you are statistically challenged like me and have to brush up to use it effectively in ML, please follow these-

https://www.youtube.com/watch?v=NF5_btOaCig

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top