Select reference level in y-variable/ LHS/ endogenous side using patsy

https://stackoverflow.com//questions/25055539

21-12-2019
|

Question

I'm trying to use Patsy to make endogenous and a exogenous datamatrices, for use in binary logistic regression. I'm having problems setting the reference level of the endogenous side.

The problem with the following code is that the endogenous side have two levels, where it should only have one in binary logistic regression.

import pandas as pd
import statsmodels.api as sm
import patsy

# data:
url = 'http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv'
df = pd.read_csv(url)
df = df.iloc[:10,1:]
df = df.loc[ ( df.Species == 'setosa') | ( df.Species == 'versicolor' ) ,]
df.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species' ]


y, X = patsy.dmatrices("C(Species,Treatment('versicolor')) ~ Sepal_Length",data = df, return_type = 'dataframe')

The shape of y is (100, 2), but i only need 1 column. So how do I get Patsy to output the endogenous side so I can use it directly in binary logistic regression?

Solution

Hmm, my advice would be to slice in to y after you do the above. Patsy isn't really designed with LHS variables in mind. Statsmodels should work in this case (currently, it doesn't, but that's a bug in statsmodels IMO. If you file a bug report on github, I can look into it.)

FYI, you can use

import statsmodels.api as sm
dta = sm.datasets.get_rdataset('iris', cache=True)

As a shortcut to get to the Rdatasets data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow