I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0
, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]