Subset a dataframe: identify combinations of columns that apear x times?

https://stackoverflow.com/questions/21889521

13-10-2022
|

Question

I know there are good tools for creative subsetting, but I'm not familiar with them, so your help is very much appreciated. I went trough similar questions and couldn't find an answer, but please point me to it if you think this is a duplicate.

Lets assume a df looking like this:

   Pop Loc BP
1    1   a 10
2    2   a 10
3    3   a 10
4    4   a 10
5    3   a 50
6    2   c 21
7    1   d 33
8    2   d  8
9    3   d  8
10   4   d  8

I want to identify which Loc are present in all 4 levels of Pop but considering Loc in combination with BP (i.e. in the above example row 5 and row 3 are different). The desired output should look like this:

   Pop Loc BP
1    1   a 10
2    2   a 10
3    3   a 10
4    4   a 10

In this example only the first 4 rows of df meet the condition, as Loc=="a" and BP=="10" exist in Pop 1, 2, 3 and 4.

Row 3 should be excluded because the combination Loc=="a" and BP==50, is only present in Pop 3, and rows 7-10 do not meet the conditions because Loc=="d" and BP==8 are not present in Pop 1.

The solution has to bee something general and more or less effective, as in the real dataset length(levels) of Locand BPis around 4,000 (Pop remains small).

I was thinking to use paste()to "merge" Locand BP into a new column and then keep only the ones that appear the desired number of times (4 in this example). But I'm sure there is a better way.

Thanks

dput() to create df:

> df<-structure(list(Pop = c(1, 2, 3, 4, 3, 2, 1, 2, 3, 4), Loc = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L), .Label = c("a", "c", "d"
), class = "factor"), BP = c(10, 10, 10, 10, 50, 21, 33, 8, 8, 
8)), .Names = c("Pop", "Loc", "BP"), row.names = c(NA, -10L), class = "data.frame")

Solution

For example using plyr, you can create a new id (using interaction) then process your comparisons by this id:

library(plyr)
ddply(transform(df,id =interaction(Loc,BP)),.(id),
      function(x)if(all(1:4%in%x$Pop))x)

  Pop Loc BP   id
1   1   a 10 a.10
2   2   a 10 a.10
3   3   a 10 a.10
4   4   a 10 a.10

OTHER TIPS

A very general solution using base R, where you can specify the grouping columns, column where your required values are, and the actual required values:

 subsetCustom <- function(
  data,
  INDICES,
  requiredValueCol,
  requiredValues)
{
  subsetData <- by(
    data = data,
    INDICES = INDICES,
    FUN = function(subdata, requiredValueCol, 
                   requiredValues) {
      if (all(requiredValues %in% subdata[, requiredValueCol])) 
        out <- subdata
      else out <- NULL
      return(out)
    },
    requiredValueCol = requiredValueCol,
    requiredValues = requiredValues)

  subsetData <- do.call(rbind, subsetData)
  return(subsetData)    
}

subsetCustom(
  data = df, 
  INDICES = list(df$Loc, df$BP),
  requiredValueCol = "Pop",
  requiredValues = 1:4)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow