pandas column division ValueError (putmask: mask and data must be the same size)

https://stackoverflow.com/questions/21513659

06-10-2022
|

Вопрос

I am attempting to divide one column by another inside of a function:

lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')

As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:

ValueError: putmask: mask and data must be the same size

I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.

A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?

Any thoughts would be greatly appreciated.

EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).

EDIT (2/1/2014): Perhaps it would be useful to share the entire function:

def qcew(f,m_dict):
    '''Function reads in file and captures county level aggregations with government contributions'''
    #Read in file
    cew=pd.read_csv(f)

    #Create string version of area fips
    cew['fips']=cew['area_fips'].astype(str)

    #Generate description variables
    cew['area']=cew['fips'].map(m_dict['area'])
    cew['industry']=cew['industry_code'].map(m_dict['industry'])
    cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
    cew['own']=cew['own_code'].map(m_dict['ownership'])
    cew['size']=cew['size_code'].map(m_dict['size'])

    #Generate boolean masks
    lagg_mask=cew['agglvl_code']==73
    lsize_mask=cew['size_code']==0

    #Subset data to above specifications
    cew_super=cew[lagg_mask & lsize_mask]

    #Define column subset
    lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
              'total_annual_wages','own_code']

    #Subset to desired columns
    cew_sub=cew_super[lsub_cols]

    #Rename columns
    cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']

    #Set index
    cew_sub.set_index(['year','fips','cty'],inplace=True)

    #Capture total wage base and the contributions of Federal, State, and Local
    cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
    cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
    cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
    cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()

    #Convert to DFs for join
    lbase=DataFrame(cew_base).rename(columns={0:'base'})
    lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
    lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
    llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})

    #Join these series
    lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)

    #Diag prints
    print f
    print lcontrib_lev.head()
    print lcontrib_lev.describe()
    print '*****************************\n'

    #Calculate proportional contributions (failure point)
    lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')

    #Group base data by year, county, and industry
    cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()

    #Join contributions to joined data
    cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])

    return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]

Решение

Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).

In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]: 
                  base  fed_wage  st_wage  loc_wage
year fips  cty                                     
2001 1000  1000    NaN       NaN      NaN       NaN
           1000    NaN       NaN      NaN       NaN
     10000 10000   NaN       NaN      NaN       NaN
           10000   NaN       NaN      NaN       NaN
     10001 10001   NaN       NaN      NaN       NaN

[5 rows x 4 columns]

In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]: 
                  base  fed_wage   st_wage  loc_wage
year fips  cty                                      
2001 CS566 CS566     1  0.000000  0.000000  0.000000
     US000 US000     1  0.022673  0.027978  0.073828
     USCMS USCMS     1  0.000000  0.000000  0.000000
     USMSA USMSA     1  0.000000  0.000000  0.000000
     USNMS USNMS     1  0.000000  0.000000  0.000000

[5 rows x 4 columns]

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow