質問

I am working on a software that performs some time series manipulations. I have recently discovered a serious issue on R script side that I had developed; the unexpected behaviour was isolated on a specific machine which had Europe/Moscow locale. The issue boils down to the following snippet:

strange_days <- c("2/1/1984", "3/1/1984", "4/1/1984", "5/1/1984", "6/1/1984") 
Sys.setenv(TZ='Europe/Moscow')
d <- strptime(strange_days, '%m/%d/%Y')
d
[1] "1984-02-01 MSK" "1984-03-01 MSK" "1984-04-01"     "1984-05-01 MSD" "1984-06-01 MSD"

Everything seems to be correctly recognized. I thought that since this is daily data, time zone attribute is not making much difference; painful mistake:

as.numeric(d)
[1] 444430800 446936400        NA 452203200 454881600

which obviously fails afterwards during conversion to an xts object.

The current fix is to force all timezones to GMT via strptime(strange_days, '%m/%d/%Y', tz='GMT') or even Sys.setenv(TZ='GMT'); the issue is gone with that.

Is it a good practice? Will the code be reliable in all situations? What techniques would you recommend to make avoid similar problems?

And what's so particular took place on the 1st of April 1984?

Edit: this and this questions are indicating this is probably a daylight saving that causes the problem.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.0

Edit 2: issue is clearly Windows-specific, not reproduced on linux with these specs:

R version 3.1.0 (2014-04-10)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.0
役に立ちましたか?

解決

In this case, since you're not interested in the time but only in the date you can use as.Date:

> as.Date(strange_days,"%m/%d/%Y")
[1] "1984-02-01" "1984-03-01" "1984-04-01" "1984-05-01" "1984-06-01"

The error you're confronted to is (as you already noticed) most likely due to Daylight Saving Time: it so happens that DST in Russia in 1984 started specifically on the first of April (source).

That being said, on a Mac OSX 10.7.5 running with R 2.14.2 (yes a little outdated) this error is not reproducible:

> strange_days <- c("2/1/1984", "3/1/1984", "4/1/1984", "5/1/1984", "6/1/1984") 
> Sys.setenv(TZ='Europe/Moscow')
> d <- strptime(strange_days, '%m/%d/%Y')
> d
[1] "1984-02-01 MSK" "1984-03-01 MSK" "1984-04-01 MSD" "1984-05-01 MSD" "1984-06-01 MSD"
> as.numeric(d)
[1] 444430800 446936400 449611200 452203200 454881600

This suggests that one of the changes made to strptime between R version 2.14.2 and 3.1.0 modified this behaviour. I'm currently looking for it in the Changelogs but I have no definite evidences yet. Another possibility would be that it is platform-specific.

Additionally here is an excerpt from ?strptime:

Remember that in most timezones some times do not occur and some occur twice because of transitions to/from summer time. strptime does not validate such times (it does not assume a specific timezone), but conversion by as.POSIXct) will do so. Conversion by strftime and formatting/printing uses OS facilities and may (and does on Windows) return nonsensical results for non-existent times at DST transitions.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top