1. A working pure-regex solution (a.k.a. Edit#2)
This task can be done purely with regular expressions (many thanks to @Mike Samuel)
First we build a database of emoticons:
(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+ c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)" ";)" ":-)" ";-)" ":(" ";(" ":-(" ";-(" ":]" ";]" ":-]" ";-]" ":[" ";[" ":-[" ";-[" ":D" ";D" ":-D" ";-D"
## [21] ":o" ";o" ":-o" ";-o" ":O" ";O" ":-O" ";-O" ":P" ";P" ":-P" ";-P" ":p" ";p" ":-p" ";-p"
An exemplary input text:
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
A helper function that escapes some special characters so that they may be used in a regex pattern (using the stringi package):
library(stringi)
escape_regex <- function(r) {
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
A regular expression to match the emoticons:
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
Now, as @Mike Samuel suggested below, we just match (emoticon)|punctuation
(note that emoticons are in a capturing group) and then replace the matches
with the result of capturing group 1 (so if it's an emoticon, we have replacement=this emoticon, if it's a punctuation char, we have replacement=nothing). This will work because the alternation with |
in ICU Regex (which is the regex engine used in stri_replace_all_regex
) is greedy and left-biased: emoticons will be matched earlier than the punctuation characters.
stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
BTW, if you want to get rid only of a selected set of characters, put e.g. [.,]
instead of [\\p{P}]
above.
2. Regex solution hint - my first (not wise) attempt (a.k.a. original answer)
My very first idea (left here mainly for "historical reasons") was to approach this problem by using look-aheads and look-behinds, but - as you see - that's far from perfect.
To remove all :
and ;
not followed by )
, (
, D
, X
, 8
, [
, or ]
use negative look-behind:
stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"
Now we can add some old-school emoticons (with noses, e.g. :-)
, ;-D
etc.)
stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"
Now the hyphens removal (negative look behind and look ahead)
stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) and the salesperson said Oh boy!"
and so on. Of course, first you should build your own database of emoticons (to leave as they are) and punctuation marks (to remove). The regex will highly depend on these two sets, so it will be difficult to add new emoticons --- it's definitely not worth applying (and may twist your brain).
3. The second attempt (regex-dumb friendlier, a.k.a. Edit#1)
On the other hand, if you're allergic to complex regexes, try this. This approach has some "didactic benefits" - we have full insight on what's being done in each of the following steps:
- Locate all emoticons within
text
;
- Locate all the punctuation characters within
text
;
- Find positions of the punctuation characters that are not parts of emoticons;
- Remove the characters located in step 3.
An exemplary input text - 1 string only - a generalized case is left as an exercise ;)
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
A helper function that escapes some special characters so that they may be used in a regex:
escape_regex <- function(r) {
library("stringi")
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
A regular expression to match the emoticons:
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
Locate the start and end positions of all the emoticons (i.e. locate the first OR the second OR ... emoticon):
where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
## start end
## [1,] 1 2
## [2,] 4 5
## [3,] 7 8
## [4,] 10 11
## [5,] 13 14
## [6,] 16 17
## [7,] 23 24
## [8,] 64 65
## [9,] 67 69
Locate all the punctuation chars (Here \\p{P}
is the Unicode character class representing punctuation characters):
where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
## start end
## [1,] 1 1
## [2,] 2 2
## [3,] 4 4
## [4,] 7 7
## [5,] 8 8
## ...
## [26,] 72 72
## [27,] 73 73
## [28,] 99 99
## [29,] 107 107
As some punctuation chars occur within the emoticons, we should not stage them for removal:
which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
any(where_punct[i,1] >= where_emots[,1] &
where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
## start end
## [1,] 27 27
## [2,] 38 38
## [3,] 39 39
## [4,] 40 40
## [5,] 46 46
## [6,] 54 54
## [7,] 58 58
## [8,] 60 60
## [9,] 71 71
## [10,] 72 72
## [11,] 73 73
## [12,] 99 99
## [13,] 107 107
Each punctuation mark surely consists only of 1 character, thus always where_punct[,1]==where_punct[,2]
.
Now the final part. As you see, where_punct[,1]
contains the positions of characters to be removed. IMHO the easiest way to do that (without loops) is by converting a string to UTF-32 (each character == 1 integer), remove undesired elements, and then go back to the textual representation:
text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1] 58 41 32 59 80 32 58 93 32 58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!
And the result is:
stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
Here you are.