Removing lines with crossed info of columns from data frame comparing two colums

https://stackoverflow.com/questions/13415199

29-11-2021
|

سؤال

This is one of my particular nightmares when I'm trying to merge different gene expression results according with pair genes conditions, here is my merged data frame:

knowngene1   Logfold1        Gene1   knowngene2   Logfold2        Gene2
uc001ezv.3  5.1167021111    NA  uc001ezu.1  5.6262305191    FLG
uc001ihe.4  4.1338871783    LOC100216001    uc001ihg.3  3.9475325801    NA
uc001iki.4  9.9902455211    CELF2   uc001ikn.2  9.3321964303    NA
uc001ikk.2  10.3059806111   CELF2   uc001ikn.2  9.3321964303    NA
uc001ikl.4  9.9890468379    CELF2   uc001ikn.2  9.3321964303    NA
uc001ikn.2  9.8293484977    NA  uc001iki.4  9.4401488053    CELF2
uc001ikn.2  9.8293484977    NA  uc001ikk.2  9.2887954663    CELF2
uc001ikn.2  9.8293484977    NA  uc001ikl.4  9.4401488053    CELF2
uc001ikn.2  9.8293484977    NA  uc010qbi.2  8.6399349792    CELF2
uc001ikn.2  9.8293484977    NA  uc010qbj.1  9.2887954663    CELF2
uc001ezu.1  5.6262305191    FLG uc001ezv.3  5.1167021111    NA
uc001ihg.3  3.9475325801    NA  uc001ihe.4  4.1338871783    LOC100216001
uc001iki.4  9.4401488053    CELF2   uc001ikn.2  9.8293484977    NA
uc001ikk.2  9.2887954663    CELF2   uc001ikn.2  9.8293484977    NA
uc001ikl.4  9.4401488053    CELF2   uc001ikn.2  9.8293484977    NA
uc001ikn.2  9.3321964303    NA  uc001iki.4  9.9902455211    CELF2
uc001ikn.2  9.3321964303    NA  uc001ikk.2  10.3059806111   CELF2
uc001ikn.2  9.3321964303    NA  uc001ikl.4  9.9890468379    CELF2
uc001ikn.2  9.3321964303    NA  uc010qbi.2  10.3865530025   CELF2
uc001ikn.2  9.3321964303    NA  uc010qbj.1  10.3072927485   CELF2
uc001iot.1  6.9068905956    NA  uc001iou.2  8.4040043896    VIM
uc001iou.2  10.4420548632   VIM uc001iot.1  5.8235197903    NA
uc001ipd.3  4.4693510978    ST8SIA6 uc001ipf.1  5.1931857169    NA
uc001kgd.3  3.5469561781    NA  uc009xts.3  4.0607448636    IFIT2
uc001kgf.3  3.3975573789    IFIT3   uc001kgd.3  3.2512633588    NA

The point is that I want to remove not the duplicated lines, of course there are not, I want to remove those which have the knowngene accessor changed in knowngene1 and knongene2 as well. Let me show an example, the first one is the line I want to keep

uc001ikn.2  9.8293484977    NA  uc001iki.4  9.4401488053    CELF2

these next lines for me are the same, in fact the first one is the specular image of the one I want to keep, despite its expression values, which more or less are in the same range

uc001iki.4  9.4401488053    CELF2   uc001ikn.2  9.8293484977    NA
uc001ikn.2  9.3321964303    NA  uc001ikl.4  9.9890468379    CELF2

So the idea is to keep ONLY the first one I see and remove the next ones. Do you have any ideas?

المحلول

So you want to remove all rows where uc001ikn.2 appears? If so,I think this will work:

Rgames> foo
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    2    3
[5,]    4    1
[6,]    3   10
[7,]    5   11
[8,]    6   12
Rgames> foo[!duplicated(foo[,1])&!(foo[,2]%in%duplicated(foo[,1])),]
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    5   11
[5,]    6   12

Where in your case, you'd operate on df$knowngene1 and df$knowngene2 columns.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow