For your fifth item, removing duplicate pairs, you can try something like the following using the select method on DataFrame. It returns a DataFrame containing the row you want to remove (assuming the duplicates you want to remove are where "interactor 2" is lexicographically greater than "interactor 1"
Filter 1
#find duplicate pairs
filter = df.select(lambda x: (
(df['Interactor 2'] > df['Interactor 1']) &
(df['Interactor 2'] == df['Interactor 1'].loc[x]) &
(df['Interactor 1'] == df['Interactor 2'].loc[x])
).any())
#remove duplicate pairs
df.drop(filter.index, inplace=True)
You'll get better performance it you iterate over a smaller collection, and perform less work on each row, so moving the first comparison out of the loop will improve performance:
Filter 2
#find duplicate pairs
filter = (df['Interactor 2'] > df['Interactor 1']).select(lambda x: (
(df['Interactor 2'] == df['Interactor 1'].loc[x]) &
(df['Interactor 1'] == df['Interactor 2'].loc[x])
).any())
#remove duplicate pairs
df.drop(filter.index, inplace=True)
To test, I'm using this datasource.
import pandas as pd
url = "http://biodev.extra.cea.fr/interoporc/files/study4932/srcInteractionsUsed.txt"
i1, i2 = 'ProteinAcA', 'ProteinAcB'
df1 = pd.read_table(url) #1470 x 7 rows
df2 = df1.ix[:10].copy(deep=True) #11 x 7 rows
df2[i1] = df1.ix[:10][i2]
df2[i2] = df1.ix[:10][i1]
df2.index = range(1481,1492)
df = pd.concat([df1, df2]) #1481 x 7 rows
filter = df[df[i1] > df[i2]].select(lambda x: (
(df[i2] == df[i1].loc[x]) &
(df[i1] == df[i2].loc[x])).any() )
or
def FilterMirroredDuplicates(dataFrame, col1, col2):
df = dataFrame[dataFrame[col1] > dataFrame[col2]]
return df.select(lambda x: ((dataFrame[col2] == df[col1].loc[x]) & (dataFrame[col1] == df[col2].loc[x])).any())
filter = FilteredMirrorDuplicates(df, i1, i2) #11 x 7 rows
The function FilterMirroredDuplicates does the same as the select statement above it. Working on this, I did find that Filter 2 above doesn't generate the appropriate set of indices to drop. Either the statement or function above should solve your problem.
Keep in mind that using select is O(n^2). But I can't think of any better means to perform this check.