regex to remove words that contains special character along with url in R

Question 1

Here is a single regex that, while horribly ugly, does the job:

gsub('(?:^|(?<=\\s))(?:(c\\+\\+|c#)|http://[^\\s]*|[^\\s]*[#/:+]+[^\\s]*)(?:\\s|$)', '\\1', x, perl=TRUE)
## [1] "Google in the of What c# my Website c++"

This uses the expression [#/:+] as the match for "special characters" other than those present in c# and c++.

Breaking this down:

First, a space must be present (but not actually matched) or it must be the beginning of the text for the match to begin: (?:^|(?<=\\s)). The choice is presented as a non-capturing group with (?:). This is important as we want to capture c# and c++ in the expression (later).

Next, a selection of three choices is given, with | as separators: (?:(c\\+\\+|c#)|http://[^\\s]*|[^\\s]*[#/:+]+[^\\s]*). This choice is another non-capturing group.

The first two selections (actually one choice, but two possibilities for the match in the regex) matches c++ or c# and captures the value with (c\\+\\+|c#). Otherwise, a URL representation may be matched with http://[^\\s]* or a word with special character with [^\\s]*[#/:+]+[^\\s]*. The URL or word with special character is not captured.

Finally, a space must be present or it must be the end of the string, as specified by (?:\s|$)the final non-capturing group: (?:\\s|$)

Then the whole expression is replaced by the first capture, which may be empty. If it is nonempty, the capture will contain the string c# or c++.

You do need perl=TRUE for this expression to be valid.

Question 2

How about this? It seems to do the trick. It seemed a bit easier to split up the string first with strsplit. One example below uses grep, and the other gsub. They each use a different regular expression. Also, the arguments to grep can be very useful at times.

> newX <-unlist(strsplit(x, "\\s"))

With grep:

> newX2 <- grep("((^[a-z]{2,3}$)|[A-Z]{1})|(c#|(\\+{2}))", newX, value = TRUE)
> paste(newX2, collapse = " ")
[1] "Google in the of What c# my Website c++"

With gsub. This is actually much easier...they key idea is to determine the pattern of where the punctuation shows up within the characters.

> paste(gsub("[a-z]{2,3}(:|#)|(\\+|//)[a-z{1}]", "", newX), collapse = " ")
[1] "Google in the of What c#  my Website c++"