Question

Another pandas question:

I have this table with hierarchical indexing:

In [51]:
from pandas import DataFrame
f = DataFrame({'a': ['1','2','3'], 'b': ['2','3','4']})
f.columns = [['level1 item1', 'level1 item2'],['', 'level2 item2'], ['level3 item1', 'level3 item2']]
f
Out[51]:
    level1 item1    level1 item2
                    level2 item2
    level3 item1    level3 item2
0         1              2
1         2              3
2         3              4

It happens that selecting level1 item1 produces the following error:

In [58]: f['level1 item1']
AssertionError: Index length did not match values

However, this seems to be somewhat related to the number of levels. When I reduce the number of levels to only two, there is no such error:

from pandas import DataFrame
f = DataFrame({'a': ['1','2','3'], 'b': ['2','3','4']})
f.columns = [['level1 item1', 'level1 item2'],['', 'level1 item2']]
f
Out[59]:
     level1 item1   level1 item2
                    level1 item2
0          1              2
1          2              3
2          3              4

Instead, the previous DataFrame gives the expected series:

In [63]:
f['level1 item1']
Out[63]:
0    1
1    2
2    3
Name: level1 item1

Filling the gap below level1 item1 with a dummy character "fixes" this issue but it's not a good solution.

How can I fix this issue without resorting to fill those columns with dummy names?

Thanks a lot!


Original example:

enter image description here

This table was produced using the following indexes:

index = [np.array(['Size and accumulated size of adjusted gross income', 'All returns', 'All returns', 'All returns', 'All returns', 'All returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns']),
np.array(['', 'Number of returns', 'Percent of total', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit', 'Number of returns', 'Percent of total', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit', 'Taxable income', 'Taxable income', 'Taxable income', 'Income tax after credits', 'Income tax after credits', 'Income tax after credits', 'Total income tax', 'Total income tax', 'Total income tax', 'Total income tax', 'Total income tax']),
np.array(['', '', '', '', '', '', '', '','', '', 'Number of returns', 'Amount', 'Percent of total', 'Number of returns', 'Amount', 'Percent of total', 'Amount', 'Percent of', 'Percent of', 'Percent of', 'Average total income tax (dollars)']),
np.array(['', '', '', 'Amount', 'Percent of total', 'Average (dollars)', 'Average (dollars)', 'Average (dollars)', 'Amount', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Total', 'Taxable income', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit'])]

df.columns = index

That's an almost perfect copy of some data in a CSV file but you can see that below "Number of returns", "Percent of total" and "Adjusted gross income less deficit" there is a gap. That gap produces this error when I try to select Number of returns:

In [68]: df['Taxable returns']['Number of returns']
AssertionError: Index length did not match values

I don't understand this error. So a good explanation would be highly appreciated. In any case, when I fill that gap using this index (notice the first elements in the third numpy array):

index = [np.array(['Size and accumulated size of adjusted gross income', 'All returns', 'All returns', 'All returns', 'All returns', 'All returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns', 'Taxable returns']),
np.array(['', 'Number of returns', 'Percent of total', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit', 'Number of returns', 'Percent of total', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit', 'Taxable income', 'Taxable income', 'Taxable income', 'Income tax after credits', 'Income tax after credits', 'Income tax after credits', 'Total income tax', 'Total income tax', 'Total income tax', 'Total income tax', 'Total income tax']),
np.array(['1', '2', '3', '4', '5', '6', '7', '8','9', '10', 'Number of returns', 'Amount', 'Percent of total', 'Number of returns', 'Amount', 'Percent of total', 'Amount', 'Percent of', 'Percent of', 'Percent of', 'Average total income tax (dollars)']),
np.array(['', '', '', 'Amount', 'Percent of total', 'Average (dollars)', 'Average (dollars)', 'Average (dollars)', 'Amount', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Percent of total', 'Total', 'Taxable income', 'Adjusted gross income less deficit', 'Adjusted gross income less deficit'])]

df.columns = index

I get proper results:

In [71]: df['Taxable returns']['Number of returns']
Out[71]:
7
Average (dollars)
0    90,660,104
1    3,495
...
Was it helpful?

Solution

I pushed a fix for this yesterday. Here's the new behavior on github master:

In [1]: paste
from pandas import DataFrame
f = DataFrame({'a': ['1','2','3'], 'b': ['2','3','4']})
f.columns = [['level1 item1', 'level1 item2'],['', 'level2 item2'], ['level3 item1', 'level3 item2']]
f

## -- End pasted text --
Out[1]: 
  level1 item1 level1 item2
               level2 item2
  level3 item1 level3 item2
0            1            2
1            2            3
2            3            4

In [2]: f['level1 item1']
Out[2]: 
  level3 item1
0            1
1            2
2            3
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top