How to trim leading and trailing whitespace?
-
20-09-2019 - |
Question
I am having some troubles with leading and trailing whitespace in a data.frame.
Eg I like to take a look at a specific row
in a data.frame
based on a certain condition:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummyLI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame
. After looking through my code history and trying to figure out what went wrong I tried:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
All I have changed in the command is an additional whitespace after Austria.
Further annoying problems obviously arise. Eg when I like to merge two frames based on the country column. One data.frame
uses "Austria "
while the other frame has "Austria"
. The matching doesn't work.
- Is there a nice way to 'show' the whitespace on my screen so that i am aware of the problem?
- And can I remove the leading and trailing whitespace in R?
So far I used to write a simple Perl
script which removes the whitespace but it would be nice if I can somehow do it inside R.
Solution
Probably the best way is to handle the trailing whitespaces when you read your data file. If you use read.csv
or read.table
you can set the parameterstrip.white=TRUE
.
If you want to clean strings afterwards you could use one of these functions:
# returns string w/o leading whitespace
trim.leading <- function (x) sub("^\\s+", "", x)
# returns string w/o trailing whitespace
trim.trailing <- function (x) sub("\\s+$", "", x)
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
To use one of these functions on myDummy$country
:
myDummy$country <- trim(myDummy$country)
To 'show' the whitespace you could use:
paste(myDummy$country)
which will show you the strings surrounded by quotation marks (") making whitespaces easier to spot.
OTHER TIPS
As of R 3.2.0 a new function was introduced for removing leading/trailing whitespaces:
trimws()
See: http://stat.ethz.ch/R-manual/R-patched/library/base/html/trimws.html
To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15,2013 and is in CRAN. The function can also handle string vectors.
install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)
(credit goes to commenter: R. Cotton)
A simple function to remove leading and trailing whitespace:
trim <- function( x ) {
gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}
Usage:
> text = " foo bar baz 3 "
> trim(text)
[1] "foo bar baz 3"
ad1) To see white spaces you could directly call print.data.frame
with modified arguments:
print(head(iris), quote=TRUE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 "5.1" "3.5" "1.4" "0.2" "setosa"
# 2 "4.9" "3.0" "1.4" "0.2" "setosa"
# 3 "4.7" "3.2" "1.3" "0.2" "setosa"
# 4 "4.6" "3.1" "1.5" "0.2" "setosa"
# 5 "5.0" "3.6" "1.4" "0.2" "setosa"
# 6 "5.4" "3.9" "1.7" "0.4" "setosa"
See also ?print.data.frame
for other options.
Use grep or grepl to find observations with whitespaces and sub to get rid of them.
names<-c("Ganga Din\t","Shyam Lal","Bulbul ")
grep("[[:space:]]+$",names)
[1] 1 3
grepl("[[:space:]]+$",names)
[1] TRUE FALSE TRUE
sub("[[:space:]]+$","",names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"
I'd prefer to add the answer as comment to user56 but yet unable so writing as an independent answer. Removing leading and trailing blanks might be achieved through trim() function from gdata package as well:
require(gdata)
example(trim)
Usage example:
> trim(" Remove leading and trailing blanks ")
[1] "Remove leading and trailing blanks"
Another option is to use the stri_trim
function from the stringi
package which defaults to removing leading and trailing whitespace:
> x <- c(" leading space","trailing space ")
> stri_trim(x)
[1] "leading space" "trailing space"
For only removing leading whitespace, use stri_trim_left
. For only removing trailing whitespace, use stri_trim_right
. When you want to remove other leading or trailing characters, you have to specify that with pattern =
.
See also ?stri_trim
for more info.
Another related problem occurs if you have multiple spaces inbetween inputs:
> a <- " a string with lots of starting, inter mediate and trailing whitespace "
You can then easily split this string into "real" tokens using a regular expression to the split
argument:
> strsplit(a, split=" +")
[[1]]
[1] "" "a" "string" "with" "lots"
[6] "of" "starting," "inter" "mediate" "and"
[11] "trailing" "whitespace"
Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.
I created a trim.strings ()
function to trim leading and/or trailing whitespace as:
# Arguments: x - character vector
# side - side(s) on which to remove whitespace
# default : "both"
# possible values: c("both", "leading", "trailing")
trim.strings <- function(x, side = "both") {
if (is.na(match(side, c("both", "leading", "trailing")))) {
side <- "both"
}
if (side == "leading") {
sub("^\\s+", "", x)
} else {
if (side == "trailing") {
sub("\\s+$", "", x)
} else gsub("^\\s+|\\s+$", "", x)
}
}
For illustration,
a <- c(" ABC123 456 ", " ABC123DEF ")
# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF"
# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456 " "ABC123DEF "
# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] " ABC123 456" " ABC123DEF"
Best method is trimws()
Following code will apply this function to entire dataframe
mydataframe<- data.frame(lapply(mydataframe, trimws),stringsAsFactors = FALSE)
I tried trim(). Works well with white spaces as well as the '\n'. x = '\n Harden, J.\n '
trim(x)
myDummy[myDummy$country == "Austria "] <- "Austria"
After this, you'll need to force R not to recognize "Austria " as a level. Let's pretend you also have "USA" and "Spain" as levels:
myDummy$country = factor(myDummy$country, levels=c("Austria", "USA", "Spain"))
A little less intimidating than the highest voted response, but it should still work.