Question

Three columns of my data.frame contain subjects. I want to subset this data.frame for different subjects. E.g. if I want to have a data.frame with the subject "apple", the row should be selected if the word "apple" appears in one of the three columns.

doc    <- c("blabla1", "blabla2", "blabla3", "blabla4")
subj.1 <- c("apple", "prune", "coconut", "berry")
subj.2 <- c("coconut", "apple", "cherry", "banana and prune")
subj.3 <- c("berry", "banana", "apple and berry", "pear", "prune")
subjects <- c("apple", "prune", "coconut", "berry", "cherry", "pear", "banana")

mydf <- data.frame(doc, subj.1, subj.2, subj.3, stringsAsFactors=FALSE) 
mydf

#       doc   subj.1            subj.2             subj.3
# 1 blabla1    apple           coconut              berry
# 2 blabla2    prune             apple             banana
# 3 blabla3  coconut            cherry    apple and berry
# 4 blabla4    berry  banana and prune               pear

the output for subject "apple" should look like this:

#       doc   subj.1            subj.2             subj.3
# 1 blabla1    apple           coconut              berry
# 2 blabla2    prune             apple             banana
# 3 blabla3  coconut            cherry    apple and berry

EDIT1: In addition, let's say i have about 200 different subjects and therefor I want 200 different data.frames. How could I do that?

I tried a loop approach:

mylist <- vector('list', length(subjects))

for(i in 1:length(subjects)) {
pattern <- subjects[i]
filter <- grepl(pattern, ignore.case=T, mydf$subj.1)
      grepl(pattern, ignore.case=T, mydf$subj.2)
      grepl(pattern, ignore.case=T, mydf$subj.3)
    subDF <- panel[filter,] 

mylist[[i]] <- subDF
  }

but there's the error:

Error in grepl(pattern, ignore.case = T, panel$SUBJECT.1) : 
 invalid regular expression 'C++ PROGRAMMING', reason 'Invalid use of repetition operators'

EDIT2: oh I see, in the original data.frame, one of the subjects is "C++ PROGRAMMING". Might that "++" cause the error?

Was it helpful?

Solution

You can use grepl function :

pattern <- 'apple'
filter <- grepl(pattern, ignore.case=T, mydf$subj.1) | 
          grepl(pattern, ignore.case=T, mydf$subj.2) | 
          grepl(pattern, ignore.case=T, mydf$subj.3)
subDF <- mydf[filter,] 

> subDF 
      doc  subj.1  subj.2          subj.3
1 blabla1   apple coconut           berry
2 blabla2   prune   apple          banana
3 blabla3 coconut  cherry apple and berry

EDIT :

About your question on for-loop, I don't see any problem in using it, and I doubt using a apply-family function would give many benefits in term of execution time.

For the error, the problem is that the string pattern passed to grepl has to be a valid regular expression but '+' is a special character and so '++' is not allowed.
Anyway, if you just want to check if the subject string is contained in the column, you can disable the regular expression engine by setting the grepl argument fixed=TRUE ( this means pattern is a string to be matched as is).

The only drawback is that ignore.case cannot be used with fixed = TRUE.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top