Can the frequency of a Pandas tseries DatetimeIndex be preserved when writing to an HDFStore?

https://stackoverflow.com/questions/23522023

17-07-2023
|

質問

I have a Pandas DataFrame in which the index is (notice the Freq: H) -

<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00, ..., 2013-12-31 23:00:00]
Length: 26304, Freq: H, Timezone: None

There are multiple columns but the first few rows (and others scattered throughout) have all NA entries. If I write this to a HDF file thus:

hdfstore.put('/table', df, format='table', data_columns=True, append=False)

and then read it back with:

df = hdfstore['/table']

and look at the index, I see:

<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-11 04:00:00, ..., 2013-12-31 23:00:00]
Length: 24656, Freq: None, Timezone: None

Notice that the Freq is now None and also that there are less rows and a later start date-time. The first row is now the first row of the original DataFrame that contains at least one non-NA column value.

Firstly, is this expected behaviour due to limitations of the HDF5 format and how DataFrames are stored, or a bug?

Is there a clean way to avoid this happening, or do I just need to 'fix' up the index after load. Not sure what the best way to do that is either.

解決

Their is an option introduced in 0.13.1 (might have been 0.13.0), where you can set dropna=False on a put/append to avoid dropping an all-NaN row. This is done for efficiency, as most of the time in say storing a Panel, you have lots of all-NaN rows, but no reason to store them.

Otherwise the frequency information will be preserved. Note that if you are appending the frequency information will NOT be preserved if you append multiple times.

You can always pd.infer_freq(an_index) if you need to re-infer the freqency (if possible). Normally this is done automatically in any event if needed.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow