Question

I am working on extraction from a text - using 'stringr' package in R.I found this example:

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
"387 287 6718", "apple", "233.398.9187 ", "482 952 3315",
"239 923 8115", "842 566 4692", "Work: 579-499-7527", "$1000",
"Home: 543.355.3679")
pattern <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_extract(strings, pattern)
str_extract_all(strings, pattern)

However my string is in the below format:

strings <- c("87225324","65-62983211","65-6298-3211","8722 5324","(65) 6296-2995","(65) 6660 8060","(65) 64368308","+65 9022 7744","+65 6296-2995","+65-6427 8436","+65 6357 3323/322")

But i am not sure about the pattern to extract all the above format.Any help would be great.

Was it helpful?

Solution

The code below covers the cases in your question. Hopefully, you can generalize it if you find other character combinations in the data.

# Phone numbers (I've added an additional number with the "/" character)
strings <- c("87225324","65-62983211","65-6298-3211","8722 5324",
           "(65) 6296-2995","(65) 6660 8060","(65) 64368308","+65 9022 7744",
           "+65 6296-2995","+65-6427 8436","+65 6357 3323/322", "+65 4382 6922/6921")

# Remove all non-numeric characters except "/" (your string doesn't include any
# text like "Work:" or "Home:", but I included a regex to deal with those cases
# as well)
strings.cleaned = gsub("[- .)(+]|[a-zA-Z]*:?","", strings)

# If you're sure there are no other non-numeric characters you need to deal with 
# separately, then you can also do the following instead of the code above: 
# gsub("[^0-9/]","", strings). This regex matches any character that's not 
# a digit or "/".

strings.cleaned
 [1] "87225324"       "6562983211"     "6562983211"     "87225324"       "6562962995"    
 [6] "6566608060"     "6564368308"     "6590227744"     "6562962995"     "6564278436"    
[11] "6563573323/322" "6543826922/6921"

# Separate string vector into the cleaned strings and the two "special cases" that we 
# need to deal with separately
special.cases = strings.cleaned[grep("/", strings.cleaned)]
strings.cleaned = strings.cleaned[-grep("/", strings.cleaned)]

# Split each phone number with a "/" into two phone numbers
special.cases = unlist(lapply(strsplit(special.cases, "/"), 
                          function(x) {
                            c(x[1], 
                            paste0(substr(x[1], 1, nchar(x[1]) - nchar(x[2])), x[2]))
                          }))
special.cases
[1] "6563573323" "6563573322" "6543826922" "6543826921"

# Put the special.cases back with strings.cleaned
strings.cleaned = c(strings.cleaned, special.cases)

# Select last 8 digits from each phone number
phone.nums = as.numeric(substr(strings.cleaned, nchar(strings.cleaned) - 7, 
                                                nchar(strings.cleaned)))
phone.nums
 [1] 87225324 62983211 62983211 87225324 62962995 66608060 64368308 90227744 62962995 64278436
[11] 63573323 63573322 43826922 43826921

OTHER TIPS

The pattern argument accepts any regular expression. So if you use for instance str_extract_all(strings, pattern) inserting the regular expression "[0-9]" (which extracts any numeric portions of the string) into the pattern argument will return a list of just the numbers from each element with the element from strings. Other examples of regular expressions may be found here: https://docs.python.org/2/library/re.html.

This is what would be returned from your vector string by using "[0-9]" as the regular expression:

str_extract_all(strings,"[0-9]")

[[1]]
[1] "8" "7" "2" "2" "5" "3" "2" "4"
[[2]]
[1] "6" "5" "6" "2" "9" "8" "3" "2" "1" "1"
[[3]]
[1] "6" "5" "6" "2" "9" "8" "3" "2" "1" "1"
[[4]]
[1] "8" "7" "2" "2" "5" "3" "2" "4"
[[5]]
[1] "6" "5" "6" "2" "9" "6" "2" "9" "9" "5"
[[6]]
[1] "6" "5" "6" "6" "6" "0" "8" "0" "6" "0"
[[7]]
[1] "6" "5" "6" "4" "3" "6" "8" "3" "0" "8"
[[8]]
[1] "6" "5" "9" "0" "2" "2" "7" "7" "4" "4"
[[9]]
[1] "6" "5" "6" "2" "9" "6" "2" "9" "9" "5"
[[10]]
[1] "6" "5" "6" "4" "2" "7" "8" "4" "3" "6"
[[11]]
[1] "6" "5" "6" "3" "5" "7" "3" "3" "2" "3" "3" "2" "2"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top