This is much easier in pandas now with drop_duplicates and the keep parameter.
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
Pregunta
The pandas drop_duplicates
function is great for "uniquifying" a dataframe. I would like to drop all rows which are duplicates across a subset of columns. Is this possible?
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
As an example, I would like to drop rows which match on columns A
and C
so this should drop rows 0 and 1.
Solución
This is much easier in pandas now with drop_duplicates and the keep parameter.
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
Otros consejos
Just want to add to Ben's answer on drop_duplicates:
keep
: {‘first’, ‘last’, False}, default ‘first’
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
So setting keep
to False will give you desired answer.
DataFrame.drop_duplicates(*args, **kwargs) Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: subset : column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns keep : {‘first’, ‘last’, False}, default ‘first’ first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. False : Drop all duplicates. take_last : deprecated inplace : boolean, default False Whether to drop duplicates in place or to return a copy cols : kwargs only argument of subset [deprecated] Returns: deduplicated : DataFrame
If you want result to be stored in another dataset:
df.drop_duplicates(keep=False)
or
df.drop_duplicates(keep=False, inplace=False)
If same dataset needs to be updated:
df.drop_duplicates(keep=False, inplace=True)
Above examples will remove all duplicates and keep one, similar to DISTINCT *
in SQL
use groupby
and filter
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.groupby(["A", "C"]).filter(lambda df:df.shape[0] == 1)
Try these various things
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar","foo"], "B":[0,1,1,1,1], "C":["A","A","B","A","A"]})
>>>df.drop_duplicates( "A" , keep='first')
or
>>>df.drop_duplicates( keep='first')
or
>>>df.drop_duplicates( keep='last')
Actually, drop rows 0 and 1 only requires (any observations containing matched A and C is kept.):
In [335]:
df['AC']=df.A+df.C
In [336]:
print df.drop_duplicates('C', take_last=True) #this dataset is a special case, in general, one may need to first drop_duplicates by 'c' and then by 'a'.
A B C AC
2 foo 1 B fooB
3 bar 1 A barA
[2 rows x 4 columns]
But I suspect what you really want is this (one observation containing matched A and C is kept.):
In [337]:
print df.drop_duplicates('AC')
A B C AC
0 foo 0 A fooA
2 foo 1 B fooB
3 bar 1 A barA
[3 rows x 4 columns]
Now it is much clearer, therefore:
In [352]:
DG=df.groupby(['A', 'C'])
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
A B C
2 foo 1 B
3 bar 1 A
[2 rows x 3 columns]
You can use duplicated()
to flag all duplicates and filter out flagged rows. If you need to assign columns to new_df
later, make sure to call .copy()
so that you don't get SettingWithCopyWarning
later on.
new_df = df[~df.duplicated(subset=['A', 'C'], keep=False)].copy()
One nice feature of this method is that you can conditionally drop duplicates with it. For example, to drop all duplicated rows only if column A is equal to 'foo'
, you can use the following code.
new_df = df[~( df.duplicated(subset=['A', 'B', 'C'], keep=False) & df['A'].eq('foo') )].copy()
Also, if you don't wish to write out columns by name, you can pass slices of df.columns
to subset=
. This is also true for drop_duplicates()
as well.
# to consider all columns for identifying duplicates
df[~df.duplicated(subset=df.columns, keep=False)].copy()
# the same is true for drop_duplicates
df.drop_duplicates(subset=df.columns, keep=False)
# to consider columns in positions 0 and 2 (i.e. 'A' and 'C') for identifying duplicates
df.drop_duplicates(subset=df.columns[[0, 2]], keep=False)
If you want to check 2 columns with try and except statements, this one can help out.
if "column_2" in df.columns:
try:
df[['column_1', "column_2"]] = df[['header', "column_2"]].drop_duplicates(subset = ["column_2", "column_1"] ,keep="first")
except:
df[["column_2"]] = df[["column_2"]].drop_duplicates(subset="column_2" ,keep="first")
print(f"No column_1 for {path}.")
try:
df[["column_1"]] = df[["column_1"]].drop_duplicates(subset="column_1" ,keep="first")
except:
print(f"No column_1 or column_2 for {path}.")