سؤال

I have a dataframe which has three columns as shown below. There are about 10,000 entries in the dataframe and there are duplicates as well.

Hospital_ID   District_ID  Employee
Hospital 1    District 19   5 
Hospital 1    District 19   10
Hospital 1    District 19   6
Hospital 2    District 10   50
Hospital 2    District 10   51

Now I want to remove the duplicates but I want to replace the values in my original dataframe by their mean so that it should look like this:

Hospital 1    District 19   7.0000
Hospital 2    District 10   50.5000

Thanks

هل كانت مفيدة؟

المحلول

As Emre already said you may use the groupby function. After that you should apply reset_index to move the MultiIndex to the columns:

import pandas as pd

df = pd.DataFrame( [ ['Hospital 1', 'District 19', 5],
                     ['Hospital 1', 'District 19', 10],
                     ['Hospital 1', 'District 19', 6],
                     ['Hospital 2', 'District 10', 50],
                     ['Hospital 2', 'District 10', 51]], columns = ['Hospital_ID', 'District_ID', 'Employee'] )

df = df.groupby( ['Hospital_ID', 'District_ID'] ).mean()

which gives you

  Hospital_ID  District_ID  Employee
0  Hospital 1  District 19       7.0
1  Hospital 2  District 10      50.5

نصائح أخرى

What you want to do is called aggregation; deduplication or duplicate removal is something else. I think the code self-explanatory:

df.groupby(['Hospital_ID', 'District_ID']).mean()

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top