Creating a multi-index from csv census data

https://stackoverflow.com/questions/18926100

29-06-2022
|

Question

I would like to create a multi indexed dataframe so I can calculate values in a more organized way.

I know a MUCH more elegant solution is out there, but I'm struggling to find it. Most of the stuff I've found involves series and tuples. I'm fairly new to pandas (and programming) and this is my first attempt at using/creating multi-indexes.

After downloading census data as csv and creating dataframe with pertinent fields I have:

county housingunits2010 housingunits2012 occupiedunits2010 occupiedunits2012
8001   120              200              50                100
8002   100              200              75                125

And I want to end up with:

id    Year  housingunits occupiedunits
8001  2010  120          50
      2012  200          100
8002  2010  100          75
      2012  200          125

And then be able to add columns from calculated values (ie difference between years, %change) and from other dataframes, matching merging by county and year.

I figured out a workaround with the basic methods that I've learned (see below), but...it certainly isn't elegant. Any suggestion would be appreciated.

First creating two diff data frames

df3 = df2[["county_id","housingunits2012"]]
df4 = df2[["county_id","housingunits2010"]]

Adding the year column

df3['year'] = np.array(['2012'] * 7)
df4['year'] = np.array(['2010'] * 7)
df3.columns = ['county_id','housingunits','year']
df4.columns = ['county_id','housingunits','year']

Appending

df5 = df3.append(df4)

Writing to csv

df5.to_csv('/Users/ntapia/df5.csv', index = False)

Reading & sorting

df6 = pd.read_csv('/Users/ntapia/df5.csv', index_col=[0, 2])
df6.sort_index(0)

Result (actual data):

                      housingunits
county_id year              
8001      2010        163229
          2012        163986
8005      2010        238457
          2012        239685
8013      2010        127115
          2012        128106
8031      2010        285859
          2012        288191
8035      2010        107056
          2012        109115
8059      2010        230006
          2012        230850
8123      2010         96406
          2012         97525

Thanks!

Solution

import re
df = df.set_index('county')
df = df.rename(columns=lambda x: re.search(r'([a-zA-Z_]+)(\d{4})', x).groups())
df.columns = MultiIndex.from_tuples(df.columns, names=['label', 'year'])
s = df.unstack()
s.name = 'count'
print(s)

gives

label          year  county
housingunits   2010  8001      120
                     8002      100
               2012  8001      200
                     8002      200
occupiedunits  2010  8001       50
                     8002       75
               2012  8001      100
                     8002      125
Name: count, dtype: int64

If you want that in a DataFrame call reset_index():

print(s.reset_index())

yields

           label  year  county  numunits
0   housingunits  2010    8001       120
1   housingunits  2010    8002       100
2   housingunits  2012    8001       200
3   housingunits  2012    8002       200
4  occupiedunits  2010    8001        50
5  occupiedunits  2010    8002        75
6  occupiedunits  2012    8001       100
7  occupiedunits  2012    8002       125

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow