Pergunta

I have a pandas DataFrame containing a time series column. The years are shifted in the past, so that I have to add a constant number of years to every element of that column.

The best way I found is to iterate through all the records and use

x.replace(year=x.year + years)  # x = current element, years = years to add

It is cythonized as below, but still very slow (proofing)

cdef list _addYearsToTimestamps(list elts, int years):
    cdef cpdatetime x
    cdef int i
    for (i, x) in enumerate(elts):
        try:
            elts[i] = x.replace(year=x.year + years)
        except Exception as e:
            logError(None, "Cannot replace year of %s - leaving value as this: %s" % (str(x), repr(e)))
    return elts

def fixYear(data):
    data.loc[:, 'timestamp'] = _addYearsToTimestamps(list(data.loc[:, 'timestamp']), REAL_YEAR-(list(data[-1:]['timestamp'])[0].year))
    return data

I'm pretty sure that there is a way to change the year without iterating, by using Pandas's Timestamp features. Unfortunately, I don't find how. Could someone elaborate?

Foi útil?

Solução

Make a pandas Timedelta object then add with the += operator:

x = pandas.Timedelta(days=365)
mydataframe.timestampcolumn += x

So the key is to store your time series as timestamps. To do that, use the pandas to_datetime function:

mydataframe['timestampcolumn'] = pandas.to_datetime(x['epoch'], unit='s')

assuming you have your timestamps as epoch seconds in the dataframe x. That's not a requirement of course; see the to_datetime documentation for converting other formats.

Outras dicas

Adapted from Pete's answer, here's an implementation of the solution, and the demonstration.

#!/usr/bin/env python3

import random
import pandas
import time
import datetime

def getRandomDates(n):
    tsMin = time.mktime(time.strptime("1980-01-01 00:00:00", "%Y-%m-%d %H:%M:%S"))
    tsMax = time.mktime(time.strptime("2005-12-31 23:59:59", "%Y-%m-%d %H:%M:%S"))
    return pandas.Series([datetime.datetime.fromtimestamp(tsMin + random.random() * (tsMax - tsMin)) for x in range(0, n)])

def setMaxYear(tss, target):
    maxYearBefore = tss.max().to_datetime().year
    # timedelta cannot be given in years, so we compute the number of days to add in the next line
    deltaDays = (datetime.date(target, 1, 1) - datetime.date(maxYearBefore, 1, 1)).days
    return tss + pandas.Timedelta(days=deltaDays)

data = pandas.DataFrame({'t1': getRandomDates(1000)})
data['t2'] = setMaxYear(data['t1'], 2015)
data['delta'] = data['t2'] - data['t1']
print(data)
print("delta min: %s" % str(min(data['delta'])))
print("delta max: %s" % str(max(data['delta'])))
Licenciado em: CC-BY-SA com atribuição
scroll top