Question

I'm working on an economics paper and need some help with combining and transforming two datasets.

I have two pandas dataframes, one with a list of countries and their neighbors (borderdf) such as

borderdf
country    neighbor
sweden     norway
sweden     denmark
denmark    germany
denmark    sweden

and one with data (datadf) for each country and year such as

datadf
country    gdp    year
sweden     5454   2004
sweden     5676   2005
norway     3433   2004
norway     3433   2005
denmark    2132   2004
denmark    2342   2005

I need to create a column in the datadf for neighbormeangdp that would contain the mean of the gdp of all the neighbors, as given by neighbordf. I would like my result to look like this:

datadf
country    year    gdp    neighborsmeangdp
sweden     2004    5454   5565
sweden     2005    5676   5775

How should I go about doing this?

Était-ce utile?

La solution

You can directly merge the two using pandas merge function. The trick here is that you actually want to merge the country column in your datadf with the neighbor column in your borderdf. Then use groupby and mean to get the average neighbor gdp. Finally, merge back with the data to get the country's own GDP. For example:

import pandas as pd
from StringIO import StringIO

border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''

data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''

borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)

merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']


grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)

results_df = pd.merge(neighbor_means,data, on=['country','year'])

Autres conseils

I think a direct way is to put the GDP values in the border DataFrame. Then, all what is needed is just to sum the groupby object and then do a merge:

In [178]:

borderdf[2004]=[datadf2.ix[(item, 2004)].values[0] for item in borderdf.neighbor]
borderdf[2005]=[datadf2.ix[(item, 2005)].values[0] for item in borderdf.neighbor]
gpdf=borderdf.groupby(by=['country']).sum()
df=pd.DataFrame(gpdf.unstack(), columns=['neighborsmeangdp'])
df=df.reset_index()
df=df.rename(columns = {'level_0':'year'})
print pd.ordered_merge(datadf, df)
   country   gdp  year  neighborsmeangdp
0  denmark  2132  2004              7586
1  germany  2132  2004               NaN
2   norway  3433  2004               NaN
3   sweden  5454  2004              5565
4  denmark  2342  2005              8018
5  germany  2342  2005               NaN
6   norway  3433  2005               NaN
7   sweden  5676  2005              5775

[8 rows x 4 columns]

Sure, I have to make up some data for Germany,

germany    2132   2004
germany    2342   2005

Which I am sure in reality she is doing better.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top