Вопрос

I have a text file, which is several hundred rows long. I am trying to remove all of the [edit:add] punctuation characters from it except the "/" characters. I am currently using the strip function in the qdap package.

Here is a sample data set:

htxt <- c("{rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/", 
        "{fonttblf0fswissfcharset0 helvetica",
        "margl1440margr1440vieww9000viewh8400viewkind0")

Here is the code:

strip(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

The only problem with this beautiful function is that it removes the "/" characters. If I try to remove all characters except the "{" character it works:

strip(htxt, char.keep = "{", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

Has anyone experienced the same problem?

Это было полезно?

Решение

For whatever reason it seems the qdap:::strip always strips "/" out of character vectors. This is in the source code towards the end of the function:

x <- clean(gsub("/", " ", gsub("-", " ", x)))

This is run before the actual function which does the stripping which is defined in the body of the function strip....

So just replace the function with your own version:

strip.new <- function (x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, 
    lower.case = TRUE) 
{
    strp <- function(x, digit.remove, apostrophe.remove, char.keep, 
        lower.case) {
        if (!is.null(char.keep)) {
            x2 <- Trim(gsub(paste0(".*?($|'|", paste(paste0("\\", 
                char.keep), collapse = "|"), "|[^[:punct:]]).*?"), 
                "\\1", as.character(x)))
        }
        else {
            x2 <- Trim(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", 
                as.character(x)))
        }
        if (lower.case) {
            x2 <- tolower(x2)
        }
        if (apostrophe.remove) {
            x2 <- gsub("'", "", x2)
        }
        ifelse(digit.remove == TRUE, gsub("[[:digit:]]", "", 
            x2), x2)
    }
    unlist(lapply(x, function(x) Trim(strp(x = x, digit.remove = digit.remove, 
        apostrophe.remove = apostrophe.remove, char.keep = char.keep, 
        lower.case = lower.case))))
}

strip.new(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

#[1] "rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/"
#[2] "fonttblf0fswissfcharset0 helvetica"            
#[3] "margl1440margr1440vieww9000viewh8400viewkind0" 

The package author is pretty active on this site so he can probably clear up why strip does this by default.

Другие советы

Why not:

> gsub("[^/]", "", htxt)
[1] "/" ""  "" 

Given the clarification by @SimonO101, the regex approach might be:

gsub("[]!\"#$%&'()*+,.:;<=>?@[^_`{|}~-]", "", htxt)

Note that the first item in that sequence is "]" and the last item is "-" and that the double-quote needed to be escaped. This is what is targeted with [:punct:] with the "\" removed. to do it programatically you might use:

rem.some.punct <- function(txt, notpunct=NULL){ 
       punctstr <- "[]!\"#$%&'()*/+,.:;<=>?@[^_`{|}~-]"
       rempunct <- gsub(paste0("",notpunct), "", punctstr)
       gsub(rempunct, "", txt)}
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top