سؤال

I've got two DataFrames, as follows:

Interactor 1    Interactor 2   Interaction Type
Q99459          Q14204         MI:0914(association)
Q96G01          Q14203         MI:0914(association)
P01106          Q9H0S4         MI:0914(association)
Q9HAU4          P0CG47         MI:0414(enzymatic reaction)
O95786          Q14790         MI:0915(physical association)
... (90000 rows)

and

Gene    UniProt ID
ABI1    Q8IZP0
ABL1    P00519
AKT1    P31749
AP2A1   O95782
AP2B1   P63010
... (244 rows)

What I want to do is this:

  1. Remove any rows where the Interaction Type column in df1 differs from a set of partial strings
  2. Remove any rows where Interactor 1 is the same as Interactor 2
  3. Remove any rows where either of Interactor 1 or Interactor 2 ISN'T in df2's UniProt ID column
  4. Remove any duplicated rows
  5. (Problem here) Remove the rows where a mirrored interaction is found, but keep one

The last one is the crux of the problem, really; I'll try to explain what I mean by it. An interaction pair is the two Interactor columns. Removing duplicate interaction pairs (4) is easy, but not the mirrored versions. For example, a mirrored interaction pair would look like this:

Interactor 1     Interactor 2
Q123             Q456
Q456             Q123

These, I don't want. Or, rather, I want just ONE of them, but it doesn't matter which. How would I do this? I've got the following code, which does points (1) through (4) easily enough, but I can't figure out how to do (5)...

# Read data
input_file = 'Interaction lists/PrimesDB PPI.xlsx'
data = pd.read_excel(input_file, sheetname='Sheet 1')
data = data[['Interactor 1', 'Interactor 2', 'Interaction Type']]

# Filter: interaction types
data = data[data['Interaction Type'].str.contains(
    'MI:0407|MI:0915|MI:0203|MI:0217')]

# Filter: self-interactions
data = data[data['Interactor 1'] != data['Interactor 2']]

# Filter: included genes
genes = pd.read_excel('Interaction lists/PrimesDB PPI (filtered).xlsx',
    sheetname='Gene list')
data = data[data['Interactor 1'].isin(genes['UniProt ID'])]
data = data[data['Interactor 2'].isin(genes['UniProt ID'])]

# Filter: unique interactions
unique = data.drop_duplicates(cols=['Interactor 1', 'Interactor 2')
هل كانت مفيدة؟

المحلول

For your fifth item, removing duplicate pairs, you can try something like the following using the select method on DataFrame. It returns a DataFrame containing the row you want to remove (assuming the duplicates you want to remove are where "interactor 2" is lexicographically greater than "interactor 1"

Filter 1

#find duplicate pairs
filter = df.select(lambda x: (
                       (df['Interactor 2'] > df['Interactor 1']) &
                       (df['Interactor 2'] == df['Interactor 1'].loc[x]) & 
                       (df['Interactor 1'] == df['Interactor 2'].loc[x])
                   ).any())
#remove duplicate pairs
df.drop(filter.index, inplace=True)

You'll get better performance it you iterate over a smaller collection, and perform less work on each row, so moving the first comparison out of the loop will improve performance:

Filter 2

#find duplicate pairs
filter = (df['Interactor 2'] > df['Interactor 1']).select(lambda x: (
                       (df['Interactor 2'] == df['Interactor 1'].loc[x]) & 
                       (df['Interactor 1'] == df['Interactor 2'].loc[x])
                   ).any())
#remove duplicate pairs
df.drop(filter.index, inplace=True)

To test, I'm using this datasource.

import pandas as pd
url = "http://biodev.extra.cea.fr/interoporc/files/study4932/srcInteractionsUsed.txt"
i1, i2 = 'ProteinAcA', 'ProteinAcB'
df1 = pd.read_table(url)  #1470 x 7 rows
df2 = df1.ix[:10].copy(deep=True) #11 x 7 rows
df2[i1] = df1.ix[:10][i2]
df2[i2] = df1.ix[:10][i1]
df2.index = range(1481,1492) 
df = pd.concat([df1, df2]) #1481 x 7 rows
filter = df[df[i1] > df[i2]].select(lambda x: (
                           (df[i2] == df[i1].loc[x]) & 
                           (df[i1] == df[i2].loc[x])).any() )

or

def FilterMirroredDuplicates(dataFrame, col1, col2):
    df = dataFrame[dataFrame[col1] > dataFrame[col2]]
    return df.select(lambda x: ((dataFrame[col2] == df[col1].loc[x]) & (dataFrame[col1] == df[col2].loc[x])).any()) 
filter = FilteredMirrorDuplicates(df, i1, i2)  #11 x 7 rows

The function FilterMirroredDuplicates does the same as the select statement above it. Working on this, I did find that Filter 2 above doesn't generate the appropriate set of indices to drop. Either the statement or function above should solve your problem.

Keep in mind that using select is O(n^2). But I can't think of any better means to perform this check.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top