Question

I'm coming from mainly working in R for statistical modeling / machine learning and looking to improve my skills in Python. I am wondering the best way to create a design matrix of categorical interactions (to arbitrary degree) in python.

A toy example:

import pandas as pd
from urllib import urlopen
page = urlopen("http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv")
df = pd.read_csv(page)
df.head(n=5)

enter image description here

Lets say we want to create interactions between Outlook, Temp and Humidity. Is there an efficient way to do this? I can manually do something like this in pandas:

OutTempFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Outlook.values, df.Temperature.values]))[0],name='OutTemp')
OutHumFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Outlook.values, df.Humidity.values]))[0],name='OutHum')
TempHumFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Temperature.values, df.Humidity.values]))[0],name='TempHum')

IntFacts=pd.concat([OutTempFact,OutHumFact,TempHumFact],axis=1)
IntFacts.head(n=5)

enter image description here

which I could then pass to a scikit-learn one-hot encoder, but there is likely a much better, less manual way to create interactions between categorical variables without having to step through each combination.

import sklearn as sk
enc = sk.preprocessing.OneHotEncoder()
IntFacts_OH=enc.fit_transform(IntFacts)
IntFacts_OH.todense()
Was it helpful?

Solution

If you use the OneHotEncoder on your design matrix to obtain a one-hot design matrix, then interactions are nothing other than multiplications between columns. If X_1hot is your one-hot design matrix, where samples are lines, then for 2nd order interactions you can write

X_2nd_order = (X_1hot[:, np.newaxis, :] * X_1hot[:, :, np.newaxis]).reshape(len(X_1hot), -1)

There will be duplicates of interactions and it will contain the original features as well.

Going to arbitrary order is going to make your design matrix explode. If you really want to do that, then you should look into kernelizing with a polynomial kernel, which will let you go to arbitrary degrees easily.

Using the data frame you present, we can proceed as follows. First, a manual way to construct a one-hot design out of the data frame:

import numpy as np
indicators = []
state_names = []
for column_name in df.columns:
    column = df[column_name].values
    one_hot = (column[:, np.newaxis] == np.unique(column)).astype(float)
    indicators.append(one_hot)
    state_names = state_names + ["%s__%s" % (column_name, state) for state in np.unique(column)]

X_1hot = np.hstack(indicators)

The column names are then stored in state_names and the indicator matrix is X_1hot. Then we calculate the second order features

X_2nd_order = (X_1hot[:, np.newaxis, :] * X_1hot[:, :, np.newaxis]).reshape(len(X_1hot), -1)

In order to know the names of the columns of the second order matrix, we construct them like this

from itertools import product
one_hot_interaction_names = ["%s___%s" % (column1, column2) 
                             for column1, column2 in product(state_names, state_names)]

OTHER TIPS

Being now faced with a similar problem of wanting an easy way of integrating specific interactions from a baseline OLS model from the literature to compare against ML appraches, I came across patsy (http://patsy.readthedocs.io/en/latest/overview.html) and this scikit-learn integration patsylearn (https://github.com/amueller/patsylearn).

Below, how the interaction variables could be passed to the model:

from patsylearn import PatsyModel
model = PatsyModel(sk.linear_model.LinearRegression(), "Play-Tennis ~ C(Outlook):C(Temperature) + C(Outlook):C(Humidity) + C(Outlook):C(Wind)")

Note, that in this formulation you don't need the OneHotEncoder(), as the C in the formula tells the Patsy interpreter that these are categorical variables and they are one-hot encoded for you! But read more about it in their documentation (http://patsy.readthedocs.io/en/latest/categorical-coding.html).

Or, you could also use the PatsyTransformer, which I prefer, as it allows easy integration into scikit-learn Pipelines:

from patsylearn import PatsyTransformer
transformer = PatsyTransformer("C(Outlook):C(Temperature) + C(Outlook):C(Humidity) + C(Outlook):C(Wind)")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top