Question

I am trying to figure out the most efficent way to match two vectors of strings to a third string. I want to limit my second match to a limited number of words or characters away from the first match

lets say I have a datframe of names like this:

signers <- data.frame(
    first = 
        c("Benjamin","Thomas","Robert","George","Thomas","Jared","James","John","James","George","George","James","Edmund","George") ,
    last = 
        c( "Franklin","Mifflin","Morris","Clymer","Fitzsimons","Ingersoll","Wilson","Blair","Madison","Washington","Mason","McClurg","Randolph","Wythe")
    )

and I have some text like this:

    text <- 
"A lot of people attended the Constitutional Convention in Philadephia, including Alexander Hamilton, Benjamin Franklin and John Adams.  
Not everyone who attended the convention ended up signing the Constitution, including George Wythe, John F. Mercer and Edmund Jennings Randolph who abstained."

I want to search for each name in the "signers" data frame and flag whether they are in the text or not.

In the case of Benjamin Franklin and George Wythe the names are in the text exactly. In the case of Edmund Randolph, one word or 10 characters are in between his first and last names.

So I am looking for something like this:

      first       last      inparagraph
1  Benjamin   Franklin      1
2    Thomas    Mifflin
3    Robert     Morris
4    George     Clymer
5    Thomas Fitzsimons
6     Jared  Ingersoll
7     James     Wilson
8      John      Blair
9     James    Madison
10   George Washington
11   George      Mason
12    James    McClurg
13   Edmund   Randolph      1
14   George      Wythe      1

I have though to use the lappy function to find where the first names are located but am unsure how to search within the proximate of where the first name was located.

namesfinds <- lapply( signers$first ,  grep, text )
Was it helpful?

Solution

Here is an option that allows up to three words or initials between first and last names using regular expressions:

patterns <- paste0("(.*)(", signers$first, "(\\s+[[:alpha:].]+){,3}\\s+", signers$last, ")(.*)")
signers$inparagraph <- ifelse(sapply(patterns, grepl, text), "1", "")

Produces:

      first       last inparagraph
1  Benjamin   Franklin           1
2    Thomas    Mifflin            
3    Robert     Morris            
4    George     Clymer            
5    Thomas Fitzsimons            
6     Jared  Ingersoll            
7     James     Wilson            
8      John      Blair           1
9     James    Madison            
10   George Washington            
11   George      Mason            
12    James    McClurg            
13   Edmund   Randolph           1
14   George      Wythe           1

Note John Blair matches because I modified text for testing purposes to include him (see data below). If you want to allow fewer words you can change {,3} to a lower number. Now, if you wanted to actually extract the matched names, you could do:

unname(sapply(patterns, gsub, "\\2", text))[sapply(patterns, grepl, text)]
# [1] "Benjamin Franklin"        "John W. F. Blair"         "Edmund Jennings Randolph"
# [4] "George Wythe"     

Here is the text I used:

text <- 
  "A lot of people attended the Constitutional Convention in Philadephia, including Alexander Hamilton, Benjamin Franklin and John Adams.  
Not everyone who attended the convention ended up signing the Constitution, including George Wythe, John F. Mercer and Edmund Jennings Randolph who abstained and John W. F. Blair ate cake"

OTHER TIPS

It may not be pretty, but this seems to work. Pasting together a regular expression to catch the middle name was the trick I used. Looks like it works with any name. Hopefully it works in all your data.

> a <- paste(signers[,1], signers[,2])
> pst <- paste(signers$first, ".*", signers$last, sep = "")
> gg <- gsub("\\.\\*", " ", names(unlist(sapply(pst, grep, text))))
> signers$inparagraph <- ifelse(a %in% gg, "1", "")
> signers
##       first       last inparagraph
## 1  Benjamin   Franklin           1
## 2    Thomas    Mifflin           
## 3    Robert     Morris           
## 4    George     Clymer           
## 5    Thomas Fitzsimons           
## 6     Jared  Ingersoll           
## 7     James     Wilson           
## 8      John      Blair           
## 9     James    Madison           
## 10   George Washington           
## 11   George      Mason           
## 12    James    McClurg           
## 13   Edmund   Randolph           1
## 14   George      Wythe           1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top