Pergunta

I am trying to write a regular expression whose matching pattern excludes certain strings. It should remove all occurrences of number only and alphanumeric strings, and also remove all punctuation marks but keep certain meaningful strings (911, K-12, K9, E-COMMERCE, etc.).

I figured I need to use a negative lookahead and specify what needs to be skipped. The matching pattern works almost as needed, but there are a couple for which it doesn't work. Below is the code, and the results from the matching. There are a couple for which I've specified what should the result be. The ones I can't figure out are a string with a combination of punctuations, numbers and characters. Any help is greatly appreciated. Thanks.

blah <- c('ASDF911 2346', 'E-COMMERCE', 'AMAZON E-COMMERCE', 'K-12 89752 911', '65426 -', 'TEACHERK-12', 'K9 OFFICER', 'WORK - K-9564', 'DEVELOPER C++', ' C+ C +5', 'DEFAULT - 456')
gsub('(^| )(?!(911|E[-]COMMERCE|K[-]12|C[+]{1,2}))([[:punct:]]|[0-9]+|([0-9]+[A-Z]+|[A-Z]+[0-9]+)[0-9A-Z]*)', ' ', blah, perl = TRUE)

" "                     # OK
"E-COMMERCE"            # OK
"AMAZON E-COMMERCE"     # OK
"K-12  911"             # OK
"  "                    # OK
"TEACHERK-12"           # this should be "  "
"K9 OFFICER"            # OK
"WORK K-9564"           # this should be "WORK   "
"DEVELOPER C++"         # OK
" C+ C 5"               # this should be " C+ C "
"DEFAULT  "             # OK
Foi útil?

Solução

Easier would be to match both, and then replace with the white-listed keywords:

gsub('(?:\\b(911\\b|E-COMMERCE\\b|K-12\\b|C\\b[+]{0,2})|[[:punct:]]|[A-Z-]*[0-9][A-Z0-9-]*)', '\\1', blah, perl = TRUE)

Output:

" "
"E-COMMERCE"
"AMAZON E-COMMERCE"
"K-12  911"
" "
""
" OFFICER"   # Should this really be "K9 OFFICER"?
"WORK  "
"DEVELOPER C++"
" C+ C "
"DEFAULT  "
  • \b is a word boundary. It matches the empty string at the edges of a sequence of word characters ([A-Za-z0-9_]). It is an optimized version of (?<!\w)(?=\w)|(?<=\w)(?!\w).
  • [A-Z-]*[0-9][A-Z0-9-]* matches strings of letters, digits and dashes, with at least one digit in them.

http://ideone.com/E3TUU5

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top