I'm trying to determine (as my application is dealing with lots of data from different sources and different time zones, formats, etc) how best to store my data AND work with it.

For example, should I store everything as UTC? This means when I fetch data I need to determine what timezone it is currently in, and if it's NOT UTC, do the necessary conversion to make it so. (Note, I'm in EST).

Then, when performing computations on the data, should I extract (say it's UTC) and get into MY time zone (EST), so it makes sense when I'm looking at it? I should I keep it in UTC and do all my calculations?

A lot of this data is time series and will be graphed, and the graph will be in EST.

This is a Python project, so lets say I have a data structure that is:

"id1": {
    "interval": 60,                            <-- seconds, subDict['interval']
    "last": "2013-01-29 02:11:11.151996+00:00" <-- UTC, subDict['last']
},

And I need to operate on this, by determine if the current time (now()) is > the last + interval (has the 60 second elapsed)? So in code:

lastTime = dateutil.parser.parse(subDict['last'])    
utcNow = datetime.datetime.utcnow().replace(tzinfo=tz.tzutc())

if lastTime + datetime.timedelta(seconds=subDict['interval']) < utcNow:
    print "Time elapsed, do something!"

Does that make sense? I'm working with UTC everywhere, both stored and computationally...

Also, if anyone has links to good write-ups on how to work with timestamps in software, I'd love to read it. Possibly like a Joel On Software for timestamp usage in applications ?

有帮助吗?

解决方案

It seems to me as though you're already doing things 'the right way'. Users will probably expect to interact in their local time zone (input and output), but it's normal to store normalized dates in UTC format so that they are unambiguous and to simplify calculation. So, normalize to UTC as soon as possible, and localize as late as possible.

Some small amount of information about Python and timezone processing can be found here:

My current preference is to store dates as unix timestamp tv_sec values in backend storage, and convert to Python datetime.datetime objects during processing. Processing will usually be done with a datetime object in the UTC timezone and then converted to a local user's timezone just before output. I find having that having a rich object such as a datetime.datetime helps with debugging.

Timezone are a nuisance to deal with and you probably need to determine on a case-by-case basis whether it's worth the effort to support timezones correctly.

For example, let's say you're calculating daily counts for bandwidth used. Some questions that may arise are:

  1. What happens on a daylight saving boundary? Should you just assume that a day is always 24 hours for ease of calculation or do you need to always check for every daily calculation that a day may have less or more hours on the daylight savings boundary?
  2. When presenting a localized time, does it matter if a time is repeated? eg. If you have an hourly report display in localtime without a time zone attached, will it confuse the user to have a missing hour of data, or a repeated hour of data around daylight savings changes.

其他提示

Since, as I can see, you do not seem to be having any implementation problems, I would focus rather on design aspects than on code and timestamp format. I have an experience of participating in design of network support for a navigation system implemented as a distributed system in a local network. The nature of that system is such that there is a lot of data (often conflicting), coming from different sources, so solving possible conflicts and keeping data integrity is rather tricky. Just some thoughts based on that experience.

Timestamping data, even in a distributed system including many computers, usually is not a problem if you do not need a higher resoluition than one provided by system time functions and higher time synchronization accuracy than one provided by your OS components.

In the simplest case using UTC is quite reasonable, and for most of tasks it's enough. However, it's important to understand the purpose of using time stamps in your system from the very beginning of design. Time values (no matter if it is Unix time or formatted UTC strings) sometimes may be equal. If you have to resolve data conflicts based on timestamps (I mean, to always select a newer (or an older) value among several received from different sources), you need to understand if an incorrectly resolved conflict (that usually means a conflict that may be resolved in more than one way, as timestamps are equal) is a fatal problem for your system design, or not. The probable options are:

  1. If the 99.99% of conflicts are resolved in the same way on all the nodes, you do not care about the remaining 0.01%, and they do not break data integrity. In that case you may safely continue using something like UTC.

  2. If strict resolving of all the conflicts is a must for you, you have to design your own timestamping system. Timestamps may include time (maybe not system time, but some higher resolution timer), sequence number (to allow producing unique timestamps even if time resolution is not enough for that) and node identifier (to allow different nodes of your system to generate completely unique timestamps).

  3. Finally, what you need may be not timestamps based on time. Do you really need to be able to calculate time difference between a pair of timestamps? Isn't it enough just to allow ordering timestamps, not connecting them to real time moments? If you don't need time calculations, just comparisons, timestamps based on sequential counters, not on real time, are a good choice (see Lamport time for more details).

If you need strict conflict resolving, or if you need very high time resolution, you will probably have to write your own timestamp service.

Many ideas and clues may be borrowed from a book by A. Tanenbaum, "Distributed systems: Principles and paradigms". When I faced such problems, it helped me a lot, and there is a separate chapter dedicated to timestamps generation in it.

I think the best approach is to store all timestamp data as UTC. When you read it in, immediately convert to UTC; right before display, convert from UTC to your local time zone.

You might even want to have your code print all timestamps twice, once in local time and the second time in UTC time... it depends on how much data you need to fit on a screen at once.

I am a big fan of the RFC 3339 timestamp format. It is unambiguous to both humans and machines. What is best about it is that almost nothing is optional, so it always looks the same:

2013-01-29T19:46:00.00-08:00

I prefer to convert timestamps to single float values for storage and computations, and then convert back to the datetime format for display. I wouldn't keep money in floats, but timestamp values are well within the precision of float values!

Working with time floats makes a lot of code very easy:

if time_now() >= last_time + interval:
    print("interval has elapsed")

It looks like you are already doing it pretty much this way, so I can't suggest any dramatic improvements.

I wrote some library functions to parse timestamps into Python time float values, and convert time float values back to timestamp strings. Maybe something in here will be useful to you:

http://home.blarg.net/~steveha/pyfeed.html

I suggest you look at feed.date.rfc3339. BSD license, so you can just use the code if you like.

EDIT: Question: How does this help with timezones?

Answer: If every timestamp you store is stored in UTC time as a Python time float value (number of seconds since the epoch, with optional fractional part), you can directly compare them; subtract one from another to find out the interval between them; etc. If you use RFC 3339 timestamps, then every timestamp string has the timezone right there in the timestamp string, and it can be correctly converted to UTC time by your code. If you convert from a float value to a timestamp string value right before displaying, the timezone will be correct for local time.

Also, as I said, it looks like he is already pretty much doing this, so I don't think I can give any amazing advice.

Personally I'm using the Unix-time standard, it's very convenient for storage due to its simple representation form, it's merely a sequence of numbers. Since internally it represent UTC time, you have to make sure to generate it properly (converting from other timestamps) before storing and format it accordingly to any time zone you want.

Once you have a common timestamp format in the backend data (tz aware), plotting the data is very easy as is just a matter of setting the destination TZ.

As an example:

import time
import datetime
import pytz
# print pre encoded date in your local time from unix epoch
example = {"id1": {
                   "interval": 60,
                   "last": 1359521160.62
                   }
           }
#this will use your system timezone formatted
print time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(example['id1']['last']))
#this will use ISO country code to localize the timestamp
countrytz = pytz.country_timezones['BR'][0]
it = pytz.timezone(countrytz)
print  it.localize(datetime.datetime.utcfromtimestamp(example['id1']['last']))
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top