Question

How do I get the maximum within a subset of my dataframe in Pandas?

For example, when I do something like

statedata[statedata['state.region'] == 'Northeast'].ix[statedata['Murder'].idxmax()]

I get a KeyError that indicates that idxmax is returning the key for the global maximum, Alabama, rather than the maximum within the queried subset (from which that key is of course missing).

Is there a way to do this concisely on Pandas?


For reference, the data used here is from R, using

data(state)
statedata = cbind(data.frame(state.x77), state.abb, state.area, state.center, state.division, state.name, state.region)

then exported from R and imported by Pandas.

Was it helpful?

Solution

You could use df.loc to select the sub-DataFrame:

import pandas as pd
import pandas.rpy.common as com
import rpy2.robjects as ro

r = ro.r
statedata = r('''cbind(data.frame(state.x77), state.abb, state.area, state.center,
                 state.division, state.name, state.region)''')
df = com.convert_robj(statedata)
df.columns = df.columns.to_series().str.replace('state.', '')
subdf = df.loc[df['region']=='Northeast', 'Murder']
print(subdf)
# Connecticut       3.1
# Maine             2.7
# Massachusetts     3.3
# New Hampshire     3.3
# New Jersey        5.2
# New York         10.9
# Pennsylvania      6.1
# Rhode Island      2.4
# Vermont           5.5
# Name: Murder, dtype: float64
print(subdf.idxmax())

prints

New York

To select the state with the highest murder rate (as of 1976) for each region:

In [24]: df.groupby('region')['Murder'].idxmax()
Out[24]: 
region
North Central    Michigan
Northeast        New York
South             Alabama
West               Nevada
Name: Murder, dtype: object
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top