Domanda

I have 1 minute data for an equity as follows;

                      bidopen    bidhigh    bidlow  bidclose bidvolume
currencypair
2007-03-30 16:01:00    1.9687    1.96900    1.9686    1.9686    877.40
2007-03-30 16:02:00    1.9686    1.96905    1.9686    1.9686    897.20
2007-03-30 16:03:00    1.9686    1.96900    1.9686    1.9690    1076.11
2007-03-30 16:04:00    1.9689    1.96910    1.9688    1.9690    849.70
2007-03-30 16:05:00    1.9690    1.96900    1.9688    1.9689    1402.80

I want to add an extra column. This column will:

  • take 15 records from this point onwards (including the current time)
  • from those 15 records get the maximum bidhigh and the minimum bidlow
  • calculate the difference of the high-low and use that value in the new column

I tried the following. Firstly I read the data in.

usecols = ['datetime','bidopen','bidhigh','bidlow','bidclose','bidvolume']
df=pd.read_csv(path,parse_dates=('datetime'),index_col=0, usecols = usecols )

define a function:

def lookaheadmaxmin(df):
    df2=df[:15]
    high=df2['bidhigh'].max(axis=1)
    low=df2['bidlow'].min(axis=1)
    return high-low

then

df['newcolumn'] = map( lookaheadmaxmin, df[:15])

This errors. I'm pretty sure the 'df[:15]' in the map is the problem as I don't know how to pass a slice of the current & future records to the function

Essentially what i'm trying to do is determine how much price has moved within a 15 minute moving window as follows:

So between; 16:00 - 16:15 - how much did price move? put this in the column on the 16:00 record

16:01 - 16:16 - how much did price move? put this in the column on the 16:01 record

16:02 - 16:17 - how much did price move? put this in the column on the 16:02 record

16:03 - 16:18 - how much did price move? put this in the column on the 16:03 record

16:04 - 16:19 - how much did price move? put this in the column on the 16:04 record

16:05 - 16:20 - how much did price move? put this in the column on the 16:05 record


Additional info:

I'm using Enthought Canopy Version 1.1.0 (64 bit) for Mac. Pandas version: Version: 0.12.0-1 (incorporating numpy 1.7.1)

Source data sample:

    currencypair,datetime,bidopen,bidhigh,bidlow,bidclose,askopen,askhigh,asklow,askclose,bidvolume,askvolume
    GBPUSD,2007-03-30 16:01:00,1.96870,1.96900,1.96860,1.96860,1.96850,1.96880,1.96845,1.96850,877.40,1386.70
    GBPUSD,2007-03-30 16:02:00,1.96860,1.96905,1.96860,1.96860,1.96850,1.96890,1.96840,1.96840,897.20,1272.30
    GBPUSD,2007-03-30 16:03:00,1.96860,1.96900,1.96860,1.96900,1.96850,1.96890,1.96840,1.96880,1076.11,1333.30
    GBPUSD,2007-03-30 16:04:00,1.96890,1.96910,1.96880,1.96900,1.96880,1.96890,1.96865,1.96880,849.70,765.10
    GBPUSD,2007-03-30 16:05:00,1.96900,1.96900,1.96880,1.96890,1.96875,1.96890,1.96860,1.96870,1402.80,1240.90
    GBPUSD,2007-03-30 16:06:00,1.96890,1.96890,1.96840,1.96860,1.96870,1.96870,1.96820,1.96850,769.50,1727.30
    GBPUSD,2007-03-30 16:07:00,1.96860,1.96880,1.96820,1.96830,1.96850,1.96870,1.96810,1.96820,842.00,1865.60
    GBPUSD,2007-03-30 16:08:00,1.96830,1.96930,1.96830,1.96910,1.96820,1.96920,1.96820,1.96890,1096.60,1197.70
    GBPUSD,2007-03-30 16:09:00,1.96910,1.96920,1.96880,1.96890,1.96895,1.96910,1.96865,1.96880,368.60,432.10

As a sidenote - there's something odd in the display of records ( I'm using ipython notebook ). Even though I ignore the 'currencypair' column it oddly shows up as the column heading. (I'm including this as I don't know if it has a bearing on other things not working.)

Importing the data (using the above csv_read) (Note no 'currencypair' column named)

usecols = ['datetime','bidopen','bidhigh','bidlow','bidclose','bidvolume']
df=pd.read_csv(path,parse_dates=('datetime'),index_col=0, usecols = usecols )

then doing

    df[:5]

shows: (note it shows 'currencypair' as the column heading but in df.info() below it just shows as 'index')

                           bidopen    bidhigh    bidlow    bidclose    bidvolume
    currencypair                    
    2007-03-30 16:01:00     1.9687     1.96900     1.9686     1.9686     877.40
    2007-03-30 16:02:00     1.9686     1.96905     1.9686     1.9686     897.20
    2007-03-30 16:03:00     1.9686     1.96900     1.9686     1.9690     1076.11
    2007-03-30 16:04:00     1.9689     1.96910     1.9688     1.9690     849.70
    2007-03-30 16:05:00     1.9690     1.96900     1.9688     1.9689     1402.80

df.info() shows:

    <class 'pandas.core.frame.DataFrame'>
    Index: 2362159 entries, 2007-03-30 16:01:00 to 2013-09-02 18:59:00
    Data columns (total 5 columns):
    bidopen      2362159  non-null values
    bidhigh      2362159  non-null values
    bidlow       2362159  non-null values
    bidclose     2362159  non-null values
    bidvolume    2362159  non-null values
    dtypes: float64(5)

Importing the data an alternative way

Importing and then removing the currencypair column; (note addition of 'currencypair' then dropping the column after)

    usecols = ['currencypair','datetime','bidopen','bidhigh','bidlow','bidclose','bidvolume']
    df=pd.read_csv(path,parse_dates=('datetime'),index_col=1, usecols = usecols )
    df=df.drop('currencypair',1)

shows:

                           bidopen    bidhigh    bidlow    bidclose    bidvolume
    datetime                    
    2007-03-30 16:01:00     1.9687     1.96900     1.9686     1.9686     877.40
    2007-03-30 16:02:00     1.9686     1.96905     1.9686     1.9686     897.20
    2007-03-30 16:03:00     1.9686     1.96900     1.9686     1.9690     1076.11
    2007-03-30 16:04:00     1.9689     1.96910     1.9688     1.9690     849.70
    2007-03-30 16:05:00     1.9690     1.96900     1.9688     1.9689     1402.80

and df.info() shows: (note index now shows as 'DatetimeIndex')

    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 2362159 entries, 2007-03-30 16:01:00 to 2013-09-02 18:59:00
    Data columns (total 5 columns):
    bidopen      2362159  non-null values
    bidhigh      2362159  non-null values
    bidlow       2362159  non-null values
    bidclose     2362159  non-null values
    bidvolume    2362159  non-null values
    dtypes: float64(5)
È stato utile?

Soluzione

Pretty straightforward when you only want a couple of columns which you are specifying e.g. max of a, and min of b for example

In [65]: df = DataFrame(randn(100,4),columns=list('abcd'),
        index=date_range('20130101 16:00',periods=100,freq='T'))

In [66]: df.head(20)
Out[66]: 
                            a         b         c         d
2013-01-01 16:00:00  0.404056  0.115774 -0.202356  0.998315
2013-01-01 16:01:00 -0.231966  0.262609  1.192302 -0.702163
2013-01-01 16:02:00 -0.467005  0.744724 -0.871782 -0.308637
2013-01-01 16:03:00 -0.175704  0.036244  1.404604 -0.106320
2013-01-01 16:04:00  0.046306 -0.098140  0.535573 -0.306300
2013-01-01 16:05:00 -0.115620 -1.069991  0.790965 -0.504283
2013-01-01 16:06:00  1.496555  0.373582  1.028092 -0.816990
2013-01-01 16:07:00  0.432081  0.182106  0.115107  1.239192
2013-01-01 16:08:00 -0.245789 -2.030840  0.118330 -1.922616
2013-01-01 16:09:00 -0.358188 -0.121750  1.768505 -2.096908
2013-01-01 16:10:00 -1.634722 -0.808355 -0.773417  0.095078
2013-01-01 16:11:00 -0.396295  0.168568 -0.901945 -0.073811
2013-01-01 16:12:00 -1.364391  2.052481 -0.175291  0.927363
2013-01-01 16:13:00 -0.523331  0.042475  0.361593 -2.239468
2013-01-01 16:14:00  1.573967 -0.709043  0.551812  0.452311
2013-01-01 16:15:00  0.180578  0.846856 -2.304107 -1.283507
2013-01-01 16:16:00  0.065386  0.356015 -0.174369  1.167562
2013-01-01 16:17:00 -1.747416  1.279114  0.559075  0.200927
2013-01-01 16:18:00 -2.041764 -0.085398  2.032789  0.195671
2013-01-01 16:19:00 -0.639329  0.268832  0.394621 -0.271260

rolling functions compute from that point on, so we timeshift (which just changes the index) so that the values align (with the start point, rather than the end point)

In [67]: df['max_a'] = pd.rolling_max(df['a'].tshift(-14),15)

In [68]: df['min_b'] = pd.rolling_min(df['b'].tshift(-14),15)

In [69]: df.head(20)
Out[69]: 
                            a         b         c         d     max_a     min_b
2013-01-01 16:00:00  0.404056  0.115774 -0.202356  0.998315  1.573967 -2.030840
2013-01-01 16:01:00 -0.231966  0.262609  1.192302 -0.702163  1.573967 -2.030840
2013-01-01 16:02:00 -0.467005  0.744724 -0.871782 -0.308637  1.573967 -2.030840
2013-01-01 16:03:00 -0.175704  0.036244  1.404604 -0.106320  1.573967 -2.030840
2013-01-01 16:04:00  0.046306 -0.098140  0.535573 -0.306300  1.573967 -2.030840
2013-01-01 16:05:00 -0.115620 -1.069991  0.790965 -0.504283  1.573967 -2.030840
2013-01-01 16:06:00  1.496555  0.373582  1.028092 -0.816990  1.573967 -2.030840
2013-01-01 16:07:00  0.432081  0.182106  0.115107  1.239192  1.573967 -2.030840
2013-01-01 16:08:00 -0.245789 -2.030840  0.118330 -1.922616  1.573967 -2.030840
2013-01-01 16:09:00 -0.358188 -0.121750  1.768505 -2.096908  1.573967 -1.185540
2013-01-01 16:10:00 -1.634722 -0.808355 -0.773417  0.095078  1.573967 -1.185540
2013-01-01 16:11:00 -0.396295  0.168568 -0.901945 -0.073811  1.573967 -1.185540
2013-01-01 16:12:00 -1.364391  2.052481 -0.175291  0.927363  1.573967 -1.185540
2013-01-01 16:13:00 -0.523331  0.042475  0.361593 -2.239468  1.573967 -1.185540
2013-01-01 16:14:00  1.573967 -0.709043  0.551812  0.452311  1.573967 -1.185540
2013-01-01 16:15:00  0.180578  0.846856 -2.304107 -1.283507  1.266667 -1.185540
2013-01-01 16:16:00  0.065386  0.356015 -0.174369  1.167562  1.266667 -1.563288
2013-01-01 16:17:00 -1.747416  1.279114  0.559075  0.200927  1.266667 -1.563288
2013-01-01 16:18:00 -2.041764 -0.085398  2.032789  0.195671  1.266667 -1.810085
2013-01-01 16:19:00 -0.639329  0.268832  0.394621 -0.271260  1.266667 -1.810085

Hi low diff is just

df['max_a'] - df['min_b']

Seems you have gaps in your series, use asfreq:

In [16]: df = DataFrame(randn(10,2),columns=list('ab'),index=date_range('20130101 9:00',freq='T',periods=10))

In [17]: df
Out[17]: 
                            a         b
2013-01-01 09:00:00  0.516518 -1.497564
2013-01-01 09:01:00  1.747399  1.100530
2013-01-01 09:02:00 -0.223476 -0.682712
2013-01-01 09:03:00  0.343172 -0.341965
2013-01-01 09:04:00 -1.380057 -1.565732
2013-01-01 09:05:00 -2.156675  1.043532
2013-01-01 09:06:00 -1.237155 -0.219086
2013-01-01 09:07:00  1.626510 -0.596204
2013-01-01 09:08:00 -0.767588  0.496110
2013-01-01 09:09:00 -0.014556  0.012049

In [18]: df.index
Out[18]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:09:00]
Length: 10, Freq: T, Timezone: None

In [19]: df.append(Series(name=[Timestamp('20130101 09:15')]))
Out[19]: 
                            a         b
2013-01-01 09:00:00  0.516518 -1.497564
2013-01-01 09:01:00  1.747399  1.100530
2013-01-01 09:02:00 -0.223476 -0.682712
2013-01-01 09:03:00  0.343172 -0.341965
2013-01-01 09:04:00 -1.380057 -1.565732
2013-01-01 09:05:00 -2.156675  1.043532
2013-01-01 09:06:00 -1.237155 -0.219086
2013-01-01 09:07:00  1.626510 -0.596204
2013-01-01 09:08:00 -0.767588  0.496110
2013-01-01 09:09:00 -0.014556  0.012049
2013-01-01 09:15:00       NaN       NaN

In [20]: df.append(Series(name=[Timestamp('20130101 09:15')])).index
Out[20]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:15:00]
Length: 11, Freq: None, Timezone: None

In [21]: df.append(Series(name=[Timestamp('20130101 09:15')])).asfreq('T')
Out[21]: 
                            a         b
2013-01-01 09:00:00  0.516518 -1.497564
2013-01-01 09:01:00  1.747399  1.100530
2013-01-01 09:02:00 -0.223476 -0.682712
2013-01-01 09:03:00  0.343172 -0.341965
2013-01-01 09:04:00 -1.380057 -1.565732
2013-01-01 09:05:00 -2.156675  1.043532
2013-01-01 09:06:00 -1.237155 -0.219086
2013-01-01 09:07:00  1.626510 -0.596204
2013-01-01 09:08:00 -0.767588  0.496110
2013-01-01 09:09:00 -0.014556  0.012049
2013-01-01 09:10:00       NaN       NaN
2013-01-01 09:11:00       NaN       NaN
2013-01-01 09:12:00       NaN       NaN
2013-01-01 09:13:00       NaN       NaN
2013-01-01 09:14:00       NaN       NaN
2013-01-01 09:15:00       NaN       NaN

In [22]: df.append(Series(name=[Timestamp('20130101 09:15')])).asfreq('T').index
Out[22]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:15:00]
Length: 16, Freq: T, Timezone: None
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top