質問

In understand that when I have a category variable in a model passed to a statsmodels fit that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand', 'China' and 'Mars' I will get variables in my model of the form

Location[T.Thailand]

with one of the value not represented. By default the excluded variable seems to be the least common one. Is there a way to specify — ideally within the model specification — which value is treated as the "base value" and excluded?

役に立ちましたか?

解決

You can pass a reference arg to the Treatment contrast, using syntax like

"y ~ C(Location, Treatment(reference='China'))"

http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment

If you have a better suggestion for naming conventions please file an issue with patsy.

他のヒント

If you use single quotes to wrap your string, reference's argument needs to be wrapped with double quotes. Very easy mistake to make. I was using single quotes on both.

For example:

'y ~ C(Location, Treatment(reference="China"))'

is correct.

'y ~ C(Location, Treatment(reference='China'))'

is not correct.

Ok, maybe someone will find this one helpfull. I needed to set a new baseline category for the dependent variable, I had no idea how to do it. I searched and found nothing, so i simply added a "_" for the other categories. If you have 3 categories A, B, C, and you want your baseline to be C you just change the labeles from A and B to _A and _B. It works. I appears that the baseline category is defined by sorted()

Maybe someone knows a proper way to do it, this is not very phytonic, ja.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top