Pandas: filtering for unique strings in 2 columns

Question

For your fifth item, removing duplicate pairs, you can try something like the following using the select method on DataFrame. It returns a DataFrame containing the row you want to remove (assuming the duplicates you want to remove are where "interactor 2" is lexicographically greater than "interactor 1"

Filter 1

#find duplicate pairs
filter = df.select(lambda x: (
                       (df['Interactor 2'] > df['Interactor 1']) &
                       (df['Interactor 2'] == df['Interactor 1'].loc[x]) & 
                       (df['Interactor 1'] == df['Interactor 2'].loc[x])
                   ).any())
#remove duplicate pairs
df.drop(filter.index, inplace=True)

You'll get better performance it you iterate over a smaller collection, and perform less work on each row, so moving the first comparison out of the loop will improve performance:

Filter 2

#find duplicate pairs
filter = (df['Interactor 2'] > df['Interactor 1']).select(lambda x: (
                       (df['Interactor 2'] == df['Interactor 1'].loc[x]) & 
                       (df['Interactor 1'] == df['Interactor 2'].loc[x])
                   ).any())
#remove duplicate pairs
df.drop(filter.index, inplace=True)

To test, I'm using this datasource.

import pandas as pd
url = "http://biodev.extra.cea.fr/interoporc/files/study4932/srcInteractionsUsed.txt"
i1, i2 = 'ProteinAcA', 'ProteinAcB'
df1 = pd.read_table(url)  #1470 x 7 rows
df2 = df1.ix[:10].copy(deep=True) #11 x 7 rows
df2[i1] = df1.ix[:10][i2]
df2[i2] = df1.ix[:10][i1]
df2.index = range(1481,1492) 
df = pd.concat([df1, df2]) #1481 x 7 rows
filter = df[df[i1] > df[i2]].select(lambda x: (
                           (df[i2] == df[i1].loc[x]) & 
                           (df[i1] == df[i2].loc[x])).any() )

or

def FilterMirroredDuplicates(dataFrame, col1, col2):
    df = dataFrame[dataFrame[col1] > dataFrame[col2]]
    return df.select(lambda x: ((dataFrame[col2] == df[col1].loc[x]) & (dataFrame[col1] == df[col2].loc[x])).any()) 
filter = FilteredMirrorDuplicates(df, i1, i2)  #11 x 7 rows

The function FilterMirroredDuplicates does the same as the select statement above it. Working on this, I did find that Filter 2 above doesn't generate the appropriate set of indices to drop. Either the statement or function above should solve your problem.

Keep in mind that using select is O(n^2). But I can't think of any better means to perform this check.