Here is a single regex that, while horribly ugly, does the job:
gsub('(?:^|(?<=\\s))(?:(c\\+\\+|c#)|http://[^\\s]*|[^\\s]*[#/:+]+[^\\s]*)(?:\\s|$)', '\\1', x, perl=TRUE)
## [1] "Google in the of What c# my Website c++"
This uses the expression [#/:+]
as the match for "special characters" other than those present in c#
and c++
.
Breaking this down:
First, a space must be present (but not actually matched) or it must be the beginning of the text for the match to begin: (?:^|(?<=\\s))
. The choice is presented as a non-capturing group with (?:)
. This is important as we want to capture c#
and c++
in the expression (later).
Next, a selection of three choices is given, with |
as separators: (?:(c\\+\\+|c#)|http://[^\\s]*|[^\\s]*[#/:+]+[^\\s]*)
. This choice is another non-capturing group.
The first two selections (actually one choice, but two possibilities for the match in the regex) matches c++
or c#
and captures the value with (c\\+\\+|c#)
. Otherwise, a URL representation may be matched with http://[^\\s]*
or a word with special character with [^\\s]*[#/:+]+[^\\s]*
. The URL or word with special character is not captured.
Finally, a space must be present or it must be the end of the string, as specified by (?:\s|$)the final non-capturing group: (?:\\s|$)
Then the whole expression is replaced by the first capture, which may be empty. If it is nonempty, the capture will contain the string c#
or c++
.
You do need perl=TRUE
for this expression to be valid.