Pregunta

I have a question about removing non-alphanumeric characters from a list in R. I have a list will all sorts of odd characters, blanks, etc. and would like to remove them. I'm generally able to remove what I want using the tm package in r. I fiddled around with it, but got nowhere so thought going back to the list may be the place to start.

The list:

 list("\n    \n", "\n\n  ", "\n        ", "               ", "\n    ", 
 "\n            \n      ", "\n        ", "Home", "\n", "Expertise", 
 "Question & Research Design", "\n", "Survey Development & Validation", 
 "\n", "Data Processing", "\n", "Statistical Analysis", "\n", 
 "Publications & Grants", "\n", "Evaluation", "\n", "\n", 
 "Consulting Areas", "Business", "\n", "Education", "K-12", 
 "\n", "Â ", " Â Â  Â  Â", " | ")

The expected output

[1] ""                               ""                         ""
[4] ""                               ""                         ""
[7] ""                               "Home"                     ""
[10] "Expertise"                     "Question Research Design" ""
[13] "Survey Development Validation" ""                         "Data Processing"
[16] ""                              "Statistical Analysis"     ""
[19] "Publications Grants"           ""                         "Evaluation"
[22] ""                              ""                         "Consulting Areas"
[25] "Business"                      ""                         "Education"
[28] "K12"                           ""                         ""
[31] ""                              ""
¿Fue útil?

Solución

Strongly recommend you simply use

gsub("[^a-zA-Z0-9]","",x)

where x is the name of the list.

You probably included the foreign characters at the end of the list because you want these obliterating too - well, the above command achieves this. To explain briefly, the square brackets in the command define a collection of symbols, and the ^ symbol means "not", so everything that is not in the specified set of 62 characters (lower case a to z, upper case A to Z, and digits 0 to 9) will be replaced by the empty string "" (i.e. destroyed).

And here's the output...

 [1] ""                             ""                        ""
 [4] ""                             ""                        ""
 [7] ""                             "Home"                    ""
[10] "Expertise"                    "QuestionResearchDesign"  ""
[13] "SurveyDevelopmentValidation"  ""                        "DataProcessing"
[16] ""                             "StatisticalAnalysis"     ""
[19] "PublicationsGrants"           ""                        "Evaluation"
[22] ""                             ""                        "ConsultingAreas"
[25] "Business"                     ""                        "Education"
[28] "K12"                          ""                        ""
[31] ""                             ""

Otros consejos

I'm not sure if this gets rid of everything you're wanting to remove... But ?regexp describes all sorts of intersting broad classes you can use. For what you're describing, I think you want:

 gsub('[[:space:]|[:punct:]]+', '', yourlist)

Which gives:

 [1] ""                            ""                            ""                            ""                           
 [5] ""                            ""                            ""                            "Home"                       
 [9] ""                            "Expertise"                   "QuestionResearchDesign"      ""                           
[13] "SurveyDevelopmentValidation" ""                            "DataProcessing"              ""                           
[17] "StatisticalAnalysis"         ""                            "PublicationsGrants"          ""                           
[21] "Evaluation"                  ""                            ""                            "ConsultingAreas"            
[25] "Business"                    ""                            "Education"                   "K12"                        
[29] ""                            "Â"                           "ÂÂÂÂ"                        ""     
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top