La creazione di un oggetto XTS si traduce in timestamp alterati

https://stackoverflow.com/questions/7341857

27-10-2019
|

Domanda

Supponiamo di avere:

R> str(data)
'data.frame':   4 obs. of  2 variables:
 $ datetime: Factor w/ 4 levels "2011-01-05 09:30:00.001",..: 1 2 3 4
 $ price   : num  18.3 18.3 18.3 18.3

R> data
                 datetime price
1 2011-01-05 09:30:00.001 18.31
2 2011-01-05 09:30:00.321 18.33
3 2011-01-05 09:30:01.511 18.33
4 2011-01-05 09:30:02.192 18.34

Quando provo a caricarlo in un xts oggetto i timestamp sono sottilmente modificati:

R> x <- xts(data[-1], as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS')))
R> str(x)
An ‘xts’ object from 2011-01-05 09:30:00.000 to 2011-01-05 09:30:02.191 containing:
  Data: num [1:4, 1] 18.3 18.3 18.3 18.3
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr "price"
  Indexed by objects of class: [POSIXct,POSIXt] TZ: 
  xts Attributes:  
 NULL

 R> x
                         price
 2011-01-05 09:30:00.000 18.31
 2011-01-05 09:30:00.321 18.33
 2011-01-05 09:30:01.510 18.33
 2011-01-05 09:30:02.191 18.34

Noterai che i timestamp sono stati modificati. La prima voce ora si verifica a 09:30:00.000 Invece di ciò che hanno detto i dati originali, 09:30:00.001. Anche la terza e la quarta riga sono errate.

Cosa sta causando questo? Sto facendo qualcosa di fondamentalmente sbagliato? Ho provato vari incantesimi per ottenere i dati in un xts oggetto e sembrano tutti esibire questo comportamento.

MODIFICARE: Aggiungere sessionInfo()

R> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=C               LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.8-2 zoo_1.7-4

loaded via a namespace (and not attached):
[1] grid_2.13.1     lattice_0.19-30 tools_2.13.1

EDIT 2: Se modifico i miei dati di origine come precisione di microsecondi come segue:

datetime,price
2011-01-05 09:30:00.001000,18.31
2011-01-05 09:30:00.321000,18.33
2011-01-05 09:30:01.511000,18.33
2011-01-05 09:30:02.192000,18.34

E poi caricalo così ho:

R> test
                    datetime price
1 2011-01-05 09:30:00.001000 18.31
2 2011-01-05 09:30:00.321000 18.33
3 2011-01-05 09:30:01.511000 18.33
4 2011-01-05 09:30:02.192000 18.34

E poi, infine, convertilo in un xts oggetto e imposta il formato indice:

R> x <- xts(test[,-1], as.POSIXct(strptime(test$datetime, '%Y-%m-%d %H:%M:%OS')))
R> indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
R> x
                            [,1]
2011-01-05 09:30:00.000999 18.31
2011-01-05 09:30:00.321000 18.33
2011-01-05 09:30:01.510999 18.33
2011-01-05 09:30:02.191999 18.34

Puoi vedere anche l'effetto. Speravo che l'aggiunta della precisione aggiuntiva sarebbe stato d'aiuto, ma sfortunatamente no.

EDIT 3: Perfavore guarda @Dwin's Risposta Per un caso di test end-to-end che riproduce questo comportamento.

EDIT 4: Il comportamento non sembra essere orientato a millisecondi. Di seguito mostra lo stesso risultato alterato di un timestamp di risoluzione di microsecondi. Se cambio i miei dati di input in:

R> data
                    datetime price
1 2011-01-05 09:30:00.001001 18.31
2 2011-01-05 09:30:00.321001 18.33
3 2011-01-05 09:30:01.511001 18.33
4 2011-01-05 09:30:02.192005 18.34

E poi creare un xts oggetto:

R> x <- xts(data[-1], 
            as.POSIXct(strptime(as.character(data$datetime), '%Y-%m-%d %H:%M:%OS')))
R> indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
R> x
                           price
2011-01-05 09:30:00.001000 18.31
2011-01-05 09:30:00.321001 18.33
2011-01-05 09:30:01.511001 18.33
2011-01-05 09:30:02.192004 18.34

EDIT 5: Sembrerebbe essere un problema di precisione in virgola mobile. Osservare:

R> t <- as.POSIXct("2011-01-05 09:30:00.001001")
R> t
[1] "2011-01-05 09:30:00.001 CST"
R> as.numeric(t)
[1] 1294241400.0010008812

Ciò mostra il comportamento di errore ed è coerente con l'esempio in EDIT 4. Tuttavia, usando un esempio che non ha mostrato l'errore:

R> t <- as.POSIXct("2011-01-05 09:30:01.511001")
R> t
[1] "2011-01-05 09:30:01.511001 CST"
R> as.numeric(t)
[1] 1294241401.5110011101

Sembra come se xts O qualche componente sottostante si sta arrotondando piuttosto che al più vicino?

Soluzione

Sembra che il problema sia solo nella stampa. Usando l'originale dell'OP data:

ind <- as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS'))
as.numeric(ind)*1e6  # as expected
# [1] 1294241400001000 1294241400321000 1294241401511000 1294241402192000
ind  # wrong
# [1] "2011-01-05 09:30:00.000 CST" "2011-01-05 09:30:00.321 CST"
# [3] "2011-01-05 09:30:01.510 CST" "2011-01-05 09:30:02.191 CST"
x <- xts(data[-1], ind)
x  # wrong
#                         price
# 2011-01-05 09:30:00.000 18.31
# 2011-01-05 09:30:00.321 18.33
# 2011-01-05 09:30:01.510 18.33
# 2011-01-05 09:30:02.191 18.34
as.numeric(index(x))*1e6  # but the underlying index values are as expected
# [1] 1294241400001000 1294241400321000 1294241401511000 1294241402192000

Altri suggerimenti

Hai i tuoi tempi in un fattore:

R> str(data)
'data.frame':   4 obs. of  2 variables:
 $ datetime: Factor w/ 4 levels "2011-01-05 09:30:00.001",..: 1 2 3 4
 [...]

Questo non è il posto migliore per iniziare. Devi convertirsi in carattere. Quindi invece di

x <- xts(data[-1], as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS')))

suggerirei

x <- xts(data[-1], 
         order.by=as.POSIXct(strptime(as.character(data$datetime), 
                                      '%Y-%m-%d %H:%M:%OS')))

Nella mia esperienza, il as.character() intorno a un fattore è fondamentale. I fattori sono potenti per la modellazione, sono tuttavia un po 'fastidiosi quando li ottieni accidentalmente dalla lettura dei dati. Uso stringsAsFactor=FALSE A tuo vantaggio ed evitali sull'importazione dei dati.

Modificare: Quindi questo sembra indicare le implementazioni Strptime/Strftime. Per rendere le cose più interessanti, R ne prende alcune dal sistema operativo e un po 'di reimplement in src/main/datetime.c.

Inoltre, presta attenzione al più piccolo epsilon Puoi aggiungere a una variabile temporale e ancora che li vede come uguali. Sul mio sistema Linux a 64 bit, questo accade 10^-7:

R> sapply(seq(1, 8), FUN=function(x) identical(now, now+1/10^x)) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
R>

Lo pubblico solo così le persone che vogliono esplorarlo possono avere un esempio riproducibile che dimostra che accade più del semplice sistema dell'OP. as.character al fattore non lo impedisce che si verifichi.

dat <- read.table(textConnection("     datetime\tprice
 1\t2011-01-05 09:30:00.001\t18.31
 2\t2011-01-05 09:30:00.321\t18.33
 3\t2011-01-05 09:30:01.511\t18.33
 4\t2011-01-05 09:30:02.192\t18.34"), header =TRUE, sep="\t")
 as.character(dat$datetime)
#[1] "2011-01-05 09:30:00.001" "2011-01-05 09:30:00.321" "2011-01-05 09:30:01.511"
#[4] "2011-01-05 09:30:02.192"
  strptime(as.character(dat$datetime),         '%Y-%m-%d %H:%M:%OS')
#[1] "2011-01-05 09:30:00" "2011-01-05 09:30:00" "2011-01-05 09:30:01"
#[4] "2011-01-05 09:30:02"
 as.POSIXct(strptime(as.character(dat$datetime), 
                                       '%Y-%m-%d %H:%M:%OS'))
#[1] "2011-01-05 09:30:00 EST" "2011-01-05 09:30:00 EST" "2011-01-05 09:30:01 EST"
#[4] "2011-01-05 09:30:02 EST"
 x <- xts(dat[-1], 
          order.by=as.POSIXct(strptime(as.character(dat$datetime), 
                                       '%Y-%m-%d %H:%M:%OS')))
 x
####                price
2011-01-05 09:30:00 18.31
2011-01-05 09:30:00 18.33
2011-01-05 09:30:01 18.33
2011-01-05 09:30:02 18.34
indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
x
                           price
2011-01-05 09:30:00.000999 18.31
2011-01-05 09:30:00.321000 18.33
2011-01-05 09:30:01.510999 18.33
2011-01-05 09:30:02.191999 18.34

sessionInfo()
R version 2.13.1 RC (2011-07-03 r56263)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      splines   stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] xts_0.8-2       zoo_1.7-4       sculpt3d_0.2-2  RGtk2_2.20.12  
 [5] rgl_0.92.798    survey_3.24     hexbin_1.26.0   spam_0.23-0    
 [9] xtable_1.5-6    polspline_1.1.5 Ryacas_0.2-10   XML_3.4-0      
[13] rms_3.3-1       Hmisc_3.8-3     survival_2.36-9 sos_1.3-0      
[17] brew_1.0-6      lattice_0.19-30

loaded via a namespace (and not attached):
[1] cluster_1.14.0 tools_2.13.1

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow