Python Statsmodels Testing Coefficients from Robust Linear Model based on M-Estimators

StackOverflow https://stackoverflow.com/questions/21538519

  •  06-10-2022
  •  | 
  •  

Question

I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.

When I fit an RLM to my data in the follow way:

import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm

modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()

the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21

Two questions:

  1. Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?

  2. All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?

Thanks!

Was it helpful?

Solution

I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.

To 1.

In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.

There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.

In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.

There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS. Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.

I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.

To 2.

(see first sentence)

I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.

For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.

For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant. Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.

Finally as general caution: The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top