I have a Date, and am interested in representing it as an integer of yyyymm form. Currently, I do:

get_year_month <- function(d) { return(as.integer(format(d, "%Y%m")))}
mydate = seq.Date(from = as.Date("2012-01-01"), to = as.Date("5012-01-01"), by = 1) 
system.time(ym <- get_year_month(mydate))
#    user  system elapsed 
#    5.972   0.974   6.951 

This is very slow for large datasets. Is there a faster way? Please provide timings for your answers so they can be easily compared. Use the above example.

有帮助吗?

解决方案

Using functions from the lubridate package can be almost twice as fast as your function :

mydate = as.Date(rep("2012-01-01",1000))
library(lubridate)
library(microbenchmark)
microbenchmark(get_year_month(mydate),
               year(mydate)*100+month(mydate))

gives :

R> Unit: milliseconds
                               expr      min       lq   median       uq
             get_year_month(mydate) 2.150296 2.188370 2.218176 2.285973
 year(mydate) * 100 + month(mydate) 1.220016 1.228129 1.239704 1.284568

其他提示

You can try using yearmon class from zoo package. In general if you are doing timeseries manipulation and analysis, I would suggest using xts or atleast zoo class. xts has lot of functionality for analysis of very huge timeseries data.

Here is quick benchmark against other suggested solutions.

get_year_month <- function(d) {
    return(as.integer(format(d, "%Y%m")))
}
mydate = as.Date(rep("2012-01-01", 1e+06))

microbenchmark(get_year_month(mydate), year(mydate) * 100 + month(mydate), as.yearmon(mydate, format = "%Y-%m-%d"), times = 1)
## Unit: milliseconds
##                                     expr       min        lq    median        uq       max neval
##                   get_year_month(mydate) 1049.8813 1049.8813 1049.8813 1049.8813 1049.8813     1
##       year(mydate) * 100 + month(mydate)  434.1765  434.1765  434.1765  434.1765  434.1765     1
##  as.yearmon(mydate, format = "%Y-%m-%d")  249.6704  249.6704  249.6704  249.6704  249.6704     1

It would be best to keep your Dates in POSIXlt format if you want to manipulate them like that:

> system.time(ym <- get_year_month(mydate))
   user  system elapsed 
  4.039   0.025   4.079 
> system.time(mydatep <- as.POSIXlt(mydate))
   user  system elapsed 
  3.576   0.016   3.603 
> system.time(ym <- (1900 + mydatep$year)*100 + (mydatep$mon + 1))
   user  system elapsed 
  0.010   0.005   0.015 

It's still a little faster, and you get subsequent similar operations for free, in terms of time.

There may not be a faster way for a single item. However you can make a version of the function that operates on collections run much faster than linearly by using builtin replicate e.g.

function mydate(D) {
  x <- replicate(dim(D)[0], get_year_month(..)
  return(x)
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top