Breaking apart Character Vector into Individual Words in R

https://stackoverflow.com/questions/22845345

27-06-2023
|

Domanda

I have a character vector (vec) like this:

[1] "super good dental associates"   "cheap dentist in bel air md"    
    "dentures   "                    "dentures   "                    
    "in office teeth whitening"      "in office teeth whitening"      
    "dental gum surgery bel air, md"
[8] "dental implants"                "dental implants"                
    "veneer teeth pictures"

I need to break this apart into individuals words. I tried this:

singleWords <- strsplit(vec, ' ')[[1]]

but, I only get the split on the first element of that vector:

[1] "super"      "good"       "dental"     "associates"

How can I get a single vector of ALL the words as individual elements?

Soluzione

You could try:

strsplit(paste(vec, collapse = " "), ' ')[[1]]

Altri suggerimenti

Just to confirm my comment, and since you mentioned it wasn't working, take a look. Since a couple of the elements have extra spaces, I would recommend using \\s+ as the regex to split on instead of the single-space from my comment. Cheers.

> ( newVec <- unlist(sapply(vec, strsplit, "\\s+", USE.NAMES = FALSE)) )
# [1] "super"      "good"       "dental"     "associates" "cheap"      "dentist"   
# [7] "in"         "bel"        "air"        "md"         "dentures"   "dentures"  
#[13] "in"         "office"     "teeth"      "whitening"  "in"         "office"    
#[19] "teeth"      "whitening"  "dental"     "gum"        "surgery"    "bel"       
#[25] "air,"       "md"         "dental"     "implants"   "dental"     "implants"  
#[31] "veneer"     "teeth"      "pictures"

And since I see a stray comma in there, it might be a good idea to clean all the punctuation (if any remains) with a call to gsub

> gsub("[[:punct:]]", "", newVec)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow