Question

I want to combine word which comes after a specific word ,I have try bigram approach which is too slow and also tried with gregexpr but didnt get any good solution. for ex

text="This approach isnt good enough."
 BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
 BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good"     "good enough"

what i really want is isnt_good as single word in a text ,combine next word which comes after isnt.

text
"This approach isnt_good enough."

Any efficient approach to convert into unigram.Thanks.

Was it helpful?

Solution

To extract all occurrences of the word "isn't" and the following word you can do this:

library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)

[[1]]
[1] "isnt good"

It essentially does the same thing as the example below (from the base package) but I find the stringr solution more elegant and readable.

> regmatches(text, regexpr(pattern, text))
[1] "isnt good"

Update

To replace the occurrences of isnt x with isnt_x you just need gsub of the base package.

gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."

What you do is to use a capturing group that copies whatever is found inside the parentheses to the \\1. See this page for a good introduction: http://www.regular-expressions.info/brackets.html

OTHER TIPS

How about this function?

joinWords <- function(string, word){
  y <- paste0(word, " ")
  x <- unlist(strsplit(string, y))
  paste0(x[1], word, "_", x[2])
}

> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top