Question

I want to iterate over a dataframe's major axis date by date.

Example:

tdf = df.ix[date]

The issue I am having is that the type returned by df.ix changes, leaving me with 3 possible situations

  1. If the date does not exist in tdf an error is thrown: KeyError: 1394755200000000000

  2. If there is only one item in tdf: print type(tdf) returns <class 'pandas.core.series.Series'>

  3. If there is more than one item in tdf: print type(tdf) returns <class 'pandas.core.frame.DataFrame'>

To avoid the first case I can simply wrap this in a try catch block or thanks to jxstanford, I can avoid the try catch block by using if date in df.index:

I run into the issue afterwards with an inconsistent API with a pandas series and a pandas data frame. I could solve this by checking for types but it seems I shouldn't have to do that. I would ideally like to keep the types the same. Is there a better way of doing this?

I'm running pandas 0.13.1 and I am currently loading my data from a CSV using

Here's a full example demonstrating the problem.

from pandas import DataFrame
import datetime

path_to_csv = '/home/n/Documents/port/test.csv'

df = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')

start_dt = df.index.min()
end_dt = df.index.max()
dt_step = datetime.timedelta(days=1)

df.sort_index(inplace=True)

cur_dt = start_dt

while cur_dt != end_dt:
    if cur_dt in df.index:
       print type(df.ix[cur_dt])
    #run some other steps using cur_dt

    cur_dt += dt_step

An example CSV that demonstrates the problem is as follows:

value1,value2,value3,Date,type
1,2,4,03/13/14,a
2,3,3,03/21/14,b
3,4,2,03/21/14,a
4,5,1,03/27/14,b

The above code prints out

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

Is it possible to get the value of value1 from tdf in a consistent manner? or am I stuck making an if statement for and separately handle each case?

if type(df.ix[cur_dt]) == DataFrame:
    ....
if type(df.ix[cur_dt]) == Series:
    ....
Was it helpful?

Solution 2

This toy code will return DataFrames consistently.

def framer(rows):
    if ndim(rows) == 1:
        return rows.to_frame().T
    else:
        return rows

for cur_date in df.index:
    print type(framer(df.ix[cur_date]))

And this will give you the missing days:

df.resample(rule='D')

Have a look at the resample method docstring. It has its own options to fill up the missing data. And if you decide to make your multiple dates into a single one, the method you're looking at is groupby (if you want to combine values across rows) and drop_duplicates (if you want to ignore them). There is no need to reinvent the wheel.

OTHER TIPS

Not sure what your trying to do with the dataframe, but this might be better than a try/except:

tdf = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
while cur_dt != end_dt:
    if cur_dt in df.index:
       # do your thing

    cur_dt += dt_step

You can use the apply method of the DataFrame, using axis = 1 to work on each row of the DataFrame to build a Series with the same Index.

e.g.

def calculate_value(row):
   if row.date == pd.datetime(2014,3,21):
      return 0
   elif row.type == 'a':
      return row.value1 + row.value2 + row.value3
   else:
      return row.value1 * row.value2 * row.value3

df['date'] = df.index
df['NewValue'] = df.apply(calculate_value, axis=1)

modifies your example input as follows

            value1  value2  value3 type  NewValue       date
Date
2014-03-13       1       2       4    a         7 2014-03-13
2014-03-21       2       3       3    b         0 2014-03-21
2014-03-21       3       4       2    a         0 2014-03-21
2014-03-27       4       5       1    b        20 2014-03-27

[4 rows x 6 columns]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top