Is there a more elegant way to convert two-digit years to four-digit years with lubridate?
Pregunta
If a date vector has two-digit years, mdy()
turns years between 00 and 68 into 21st Century years and years between 69 and 99 into 20th Century years. For example:
library(lubridate)
mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))
gives the following output:
Multiple format matches with 5 successes: %m/%d/%y, %m/%d/%Y.
Using date format %m/%d/%y.
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC" "2004-01-02 UTC"
I can fix this after the fact by subtracting 100 from the incorrect dates to turn 2054 and 2068 into 1954 and 1968. But is there a more elegant and less error-prone method of parsing two-digit dates so that they get handled correctly in the parsing process itself?
Update: After @JoshuaUlrich pointed me to strptime
I found this question, which deals with an issue similar to mine, but using base R.
It seems like a nice addition to date handling in R would be some way to handle century selection cutoffs for two-digit dates within the date parsing functions.
Solución
Here is a function that allows you to do this:
library(lubridate)
x <- mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))
foo <- function(x, year=1968){
m <- year(x) %% 100
year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
x
}
Try it out:
x
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
foo(x)
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
foo(x, 1950)
[1] "1954-01-02 UTC" "1968-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"
The bit of magic here is to use the modulus operator %%
to return the fraction part of a division. So 1968 %% 100
yields 68.
Otros consejos
I just experienced this exact same bug / feature.
I ended up writing the following two quick functions to help convert from excel-type dates (which is where i get this most) to something R can use.
There's nothing wrong with the accepted answer -- it's just that i prefer not to load up on packages too much.
First, a helper to split and replace the years ...
year1900 <- function(dd_y, yrFlip = 50)
{
dd_y <- as.numeric(dd_y)
dd_y[dd_y > yrFlip] <- dd_y[dd_y > yrFlip] + 1900
dd_y[dd_y < yrFlip] <- dd_y[dd_y < yrFlip] + 2000
return(dd_y)
}
which is used by a function that 'fixes' your excel dates, depending on type:
XLdate <- function(Xd, type = 'b-Y')
{
switch(type,
'b-Y' = as.Date(paste0(substr(Xd, 5, 9), "-", substr(Xd, 1, 3), "-01"), format = "%Y-%b-%d"),
'b-y' = as.Date(paste0(year1900(substr(Xd, 5, 6)), "-", substr(Xd, 1, 3), "-01"),
format = "%Y-%b-%d"),
'Y-b' = as.Date(paste0(substr(Xd, 1, 3), "-", substr(Xd, 5, 9), "-01"), format = "%Y-%b-%d")
)
}
Hope this helps.
Another option would be:
xxx <- c("01-Jan-54","01-Feb-68","01-Aug-69","01-May-99","01-Jun-04", "
31-Dec-68","01-Jan-69", "31-Dec-99")
.
dmy(paste0(sub("\\d\\d$","",xxx) , ifelse( (tt <-
sub("\\d\\d-\\D\\D\\D-","",xxx) ) > 20 ,paste0("19",tt),paste0("20",tt))))
Though no solution is elegant nor short. I think it would be better if lubridate just added an option to specify the cutoff date.