如何修剪前导和尾随空格？

https://stackoverflow.com/questions/2261079

20-09-2019
|

题

我在 data.frame 中的前导和尾随空格方面遇到一些麻烦。例如，我喜欢看一个特定的 row 在一个 data.frame 基于某种条件：

> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)] 

[1] codeHelper     country        dummyLI    dummyLMI       dummyUMI       
[6] dummyHInonOECD dummyHIOECD    dummyOECD      
<0 rows> (or 0-length row.names)

我想知道为什么我没有得到预期的输出，因为奥地利显然存在于我的国家中 data.frame. 。在查看了我的代码历史记录并试图找出问题所在后，我尝试了：

> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
   codeHelper  country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18        AUT Austria        0        0        0              0           1
   dummyOECD
18         1

我在命令中所做的所有更改是在奥地利之后添加了一个空格。

显然还会出现更多恼人的问题。例如，当我喜欢根据国家/地区列合并两个框架时。一 data.frame 用途 "Austria " 而另一个框架有 "Austria". 。匹配不起作用。

有没有一种好方法可以“显示”屏幕上的空白，以便我意识到问题所在？
我可以删除 R 中的前导和尾随空格吗？

到目前为止我曾经写过一个简单的 Perl 删除空格的脚本，但如果我能以某种方式在 R 中完成它，那就太好了。

解决方案

最好的方法可能是在读取数据文件时处理尾随空格。如果你使用 read.csv 或者 read.table 你可以设置参数strip.white=TRUE.

如果您想随后清理字符串，可以使用以下函数之一：

# returns string w/o leading whitespace
trim.leading <- function (x)  sub("^\\s+", "", x)

# returns string w/o trailing whitespace
trim.trailing <- function (x) sub("\\s+$", "", x)

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

要使用这些功能之一 myDummy$country:

 myDummy$country <- trim(myDummy$country)

要“显示”空白，您可以使用：

 paste(myDummy$country)

这将显示由引号 (") 括起来的字符串，使空格更容易被发现。

其他提示

从 R 3.2.0 开始，引入了一个新函数来删除前导/尾随空格：

trimws()

看： http://stat.ethz.ch/R-manual/R-patched/library/base/html/trimws.html

要操作空白，请使用 stringr 包中的 str_trim() 。该软件包的手册日期为 2013 年 2 月 15 日，位于 CRAN 中。该函数还可以处理字符串向量。

install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)

（归功于评论者：R。棉布）

一个简单的功能删除前导和尾随空格：

trim <- function( x ) {
  gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}

用法：

> text = "   foo bar  baz 3 "
> trim(text)
[1] "foo bar  baz 3"

ad1) 要查看空格，您可以直接调用 print.data.frame 修改后的参数：

print(head(iris), quote=TRUE)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
# 1        "5.1"       "3.5"        "1.4"       "0.2" "setosa"
# 2        "4.9"       "3.0"        "1.4"       "0.2" "setosa"
# 3        "4.7"       "3.2"        "1.3"       "0.2" "setosa"
# 4        "4.6"       "3.1"        "1.5"       "0.2" "setosa"
# 5        "5.0"       "3.6"        "1.4"       "0.2" "setosa"
# 6        "5.4"       "3.9"        "1.7"       "0.4" "setosa"

也可以看看 ?print.data.frame 对于其他选项。

使用 grep 或 grepl 查找带有空格的观察结果，并使用 sub 删除它们。

names<-c("Ganga Din\t","Shyam Lal","Bulbul ")
grep("[[:space:]]+$",names)
[1] 1 3
grepl("[[:space:]]+$",names)
[1]  TRUE FALSE  TRUE
sub("[[:space:]]+$","",names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"

我更愿意将答案添加为 user56 的评论，但无法将其写为独立答案。删除前导和尾随空白也可以通过 gdata 包中的 trim() 函数来实现：

require(gdata)
example(trim)

使用示例：

> trim("   Remove leading and trailing blanks    ")
[1] "Remove leading and trailing blanks"

另一种选择是使用 stri_trim 函数从 stringi 默认删除前导和尾随空格的包：

> x <- c("  leading space","trailing space   ")
> stri_trim(x)
[1] "leading space"  "trailing space"

如果仅删除前导空格，请使用 stri_trim_left. 。如果仅删除尾随空格，请使用 stri_trim_right. 。当您想删除其他前导或尾随字符时，必须使用 pattern =.

也可以看看 ?stri_trim 了解更多信息。

如果输入之间有多个空格，则会出现另一个相关问题：

> a <- "  a string         with lots   of starting, inter   mediate and trailing   whitespace     "

然后，您可以使用正则表达式轻松地将这个字符串拆分为“真实”标记 split 争论：

> strsplit(a, split=" +")
[[1]]
 [1] ""           "a"          "string"     "with"       "lots"      
 [6] "of"         "starting,"  "inter"      "mediate"    "and"       
[11] "trailing"   "whitespace"

请注意，如果在（非空）字符串开始时有匹配，则输出的第一个元素是“”'，但是如果字符串末端有匹配项，则输出与删除了比赛。

我创建了一个 trim.strings () 函数修剪前导和/或尾随空白：

# Arguments:    x - character vector
#            side - side(s) on which to remove whitespace 
#                   default : "both"
#                   possible values: c("both", "leading", "trailing")

trim.strings <- function(x, side = "both") { 
    if (is.na(match(side, c("both", "leading", "trailing")))) { 
      side <- "both" 
      } 
    if (side == "leading") { 
      sub("^\\s+", "", x)
      } else {
        if (side == "trailing") {
          sub("\\s+$", "", x)
    } else gsub("^\\s+|\\s+$", "", x)
    } 
}

为了说明，

a <- c("   ABC123 456    ", " ABC123DEF          ")

# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF" 

# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456    "      "ABC123DEF          "

# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] "   ABC123 456" " ABC123DEF"

最好的方法是trimws()

以下代码将将此函数应用于整个数据帧

mydataframe<- data.frame（lapply（mydataframe，trimws），stringsAsFactors = FALSE）

我试过修剪（）。适用于空格和“ ”。x = ' 哈登，J. '

修剪(x)

myDummy[myDummy$country == "Austria "] <- "Austria"

之后，您需要强制 R 不将“Austria”识别为级别。假设您还有“美国”和“西班牙”作为级别：

myDummy$country = factor(myDummy$country, levels=c("Austria", "USA", "Spain"))

比得票最高的回应稍微不那么令人生畏，但它应该仍然有效。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow