Replacing column values in Pandas
Pergunta
I have a dataframe which has three columns as shown below. There are about 10,000 entries in the dataframe and there are duplicates as well.
Hospital_ID District_ID Employee
Hospital 1 District 19 5
Hospital 1 District 19 10
Hospital 1 District 19 6
Hospital 2 District 10 50
Hospital 2 District 10 51
Now I want to remove the duplicates but I want to replace the values in my original dataframe by their mean so that it should look like this:
Hospital 1 District 19 7.0000
Hospital 2 District 10 50.5000
Thanks
Solução
As Emre already said you may use the groupby function. After that you should apply reset_index to move the MultiIndex to the columns:
import pandas as pd
df = pd.DataFrame( [ ['Hospital 1', 'District 19', 5],
['Hospital 1', 'District 19', 10],
['Hospital 1', 'District 19', 6],
['Hospital 2', 'District 10', 50],
['Hospital 2', 'District 10', 51]], columns = ['Hospital_ID', 'District_ID', 'Employee'] )
df = df.groupby( ['Hospital_ID', 'District_ID'] ).mean()
which gives you
Hospital_ID District_ID Employee
0 Hospital 1 District 19 7.0
1 Hospital 2 District 10 50.5
Outras dicas
What you want to do is called aggregation; deduplication or duplicate removal is something else. I think the code self-explanatory:
df.groupby(['Hospital_ID', 'District_ID']).mean()
Licenciado em: CC-BY-SA com atribuição
Não afiliado a datascience.stackexchange