Domanda

I'm using the Pandas Python library to compare two dataframes, each consisting of a column of dates and two columns of values. One of the dataframes, call it LongDF, consists of more dates than the other, call it ShortDF. Both dataframes are indexed by the date using pandas.tseries.index.DatetimeIndex See below (I've shortened both up just to demonstrate).

LongDF

╔════════════╦════════╦════════╗
║ Date       ║ Value1 ║ Value2 ║
╠════════════╬════════╬════════╣
║ 1990-03-17 ║ 6.84   ║ 1.77   ║
║ 1990-03-18 ║ 0.99   ║ 7.00   ║
║ 1990-03-19 ║ 4.90   ║ 8.48   ║
║ 1990-03-20 ║ 2.57   ║ 2.41   ║
║ 1990-03-21 ║ 4.10   ║ 8.33   ║
║ 1990-03-22 ║ 8.86   ║ 1.31   ║
║ 1990-03-23 ║ 6.01   ║ 6.22   ║
║ 1990-03-24 ║ 0.74   ║ 1.69   ║
║ 1990-03-25 ║ 5.56   ║ 7.30   ║
║ 1990-03-26 ║ 8.05   ║ 1.67   ║
║ 1990-03-27 ║ 8.87   ║ 8.22   ║
║ 1990-03-28 ║ 9.00   ║ 6.83   ║
║ 1990-03-29 ║ 1.34   ║ 6.00   ║
║ 1990-03-30 ║ 1.69   ║ 0.40   ║
║ 1990-03-31 ║ 8.71   ║ 3.26   ║
║ 1990-04-01 ║ 4.05   ║ 4.53   ║
║ 1990-04-02 ║ 9.75   ║ 4.79   ║
║ 1990-04-03 ║ 7.74   ║ 0.44   ║
╚════════════╩════════╩════════╝

ShrotDF

╔════════════╦════════╦════════╗
║ Date       ║ Value1 ║ Value2 ║
╠════════════╬════════╬════════╣
║ 1990-03-25 ║ 1.98   ║ 3.92   ║
║ 1990-03-26 ║ 3.37   ║ 3.40   ║
║ 1990-03-27 ║ 2.93   ║ 7.93   ║
║ 1990-03-28 ║ 2.35   ║ 5.34   ║
║ 1990-03-29 ║ 1.41   ║ 7.62   ║
║ 1990-03-30 ║ 9.85   ║ 3.17   ║
║ 1990-03-31 ║ 9.95   ║ 0.35   ║
║ 1990-04-01 ║ 4.42   ║ 7.11   ║
║ 1990-04-02 ║ 1.33   ║ 6.47   ║
║ 1990-04-03 ║ 6.63   ║ 1.78   ║
╚════════════╩════════╩════════╝

What I'd like to do is reference the data occurring on the same day in each dataset, put data from both sets into one formula and, if it's greater than some number, paste the date and values into another dataframe.

I assume I should use something like for row in ShortDF.iterrows(): to iterate through each date on ShortDF but I can't figure out how to select the corresponding row on LongDF, using the DatetimeIndex.

Any help would be appreciated

È stato utile?

Soluzione

OK I'm awake now and using your data you can do this:

In [425]:
# key here is to tell the merge to use both sides indices
merged = df1.merge(df2,left_index=True, right_index=True)
# the resultant merged dataframe will have duplicate columns, this is fine
merged
Out[425]:
            Value1_x  Value2_x  Value1_y  Value2_y
Date                                              
1990-03-25      5.56      7.30      1.98      3.92
1990-03-26      8.05      1.67      3.37      3.40
1990-03-27      8.87      8.22      2.93      7.93
1990-03-28      9.00      6.83      2.35      5.34
1990-03-29      1.34      6.00      1.41      7.62
1990-03-30      1.69      0.40      9.85      3.17
1990-03-31      8.71      3.26      9.95      0.35
1990-04-01      4.05      4.53      4.42      7.11
1990-04-02      9.75      4.79      1.33      6.47
1990-04-03      7.74      0.44      6.63      1.78

[10 rows x 4 columns]
In [432]:
# now using boolean indexing we want just the rows where there are values larger than 9 and then select the highest value
merged[merged.max(axis=1) > 9].max(axis=1)
Out[432]:
Date
1990-03-30    9.85
1990-03-31    9.95
1990-04-02    9.75
dtype: float64

Altri suggerimenti

OK, so sometimes I like to think of pandas DataFrames as nothing more than dictionaries. This is because working with dictionaries is so easy and thinking of them like simple dicts often means you can find a solution to an issue without having to get too deep into pandas.

So in your example, say, I would just create a list of common dates if the values of the DataFrames pass some value test, and then create a new data frame using those dates to access the values in the existing data frames. In my example the test is whether value 1 in DF1 + value2 in DF2 is greater than 10:

import pandas as pd
import random 
random.seed(123)

#Create some data
DF1 = pd.DataFrame({'Date'      :   ['1990-03-17', '1990-03-18', '1990-03-19', 
                                     '1990-03-20', '1990-03-21', '1990-03-22', 
                                     '1990-03-23', '1990-03-24', '1990-03-25', 
                                     '1990-03-26', '1990-03-27', '1990-03-28',
                                     '1990-03-29', '1990-03-30', '1990-03-31', 
                                     '1990-04-01', '1990-04-02', '1990-04-03'],
                    'Value1'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(18)],
                    'Value2'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(18)]
                   })

DF2 = pd.DataFrame({'Date'      :   ['1990-03-25', '1990-03-26', '1990-03-27', 
                                     '1990-03-28', '1990-03-29', '1990-03-30', 
                                     '1990-03-31', '1990-04-01', '1990-04-02',  
                                     '1990-04-03'],
                    'Value1'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(10)],
                    'Value2'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(10)]
                   })

DF1.set_index('Date', inplace = True)
DF2.set_index('Date', inplace = True)

#Create a list of common dates, where the values of DF1.Value1  summed 
#with DF.Value2 is greater than 10
Common_Set = list(DF1.index.intersection(DF2.index))
Common_Dates =  [date for date in Common_Set if 
             DF1.Value1[date] + DF2.Value1[date] > 10]

#And now create the data frame I think you want using the Common_Dates

DF_Output = pd.DataFrame({'L_Value1' : [DF1.Value1[date] for date in Common_Dates],
                          'L_Value2' : [DF1.Value2[date] for date in Common_Dates],
                          'S_Value1' : [DF2.Value1[date] for date in Common_Dates],
                          'S_Value2' : [DF2.Value2[date] for date in Common_Dates]
                         }, index = Common_Dates)

This is definitely do-able in pandas as the comment suggest, but to me this is a simple solution. The Common_Dates operations could easily be done in a one line, but I didn't for clarity.

Of course, it might be a massive pain to write out the DF_Output DataFrame constructor if you have lots of columns in both data frames. If that is the case then you could do this:

DF1_Out = {'L' + col : [DF1[col][date] for date in Common_Dates] 
            for col in DF1.columns}
DF2_Out = {'S' + col : [DF2[col][date] for date in Common_Dates] 
            for col in DF2.columns}

DF_Out = {}
DF_Out.update(DF1_Out)
DF_Out.update(DF2_Out)

DF_Output2 = pd.DataFrame(DF_Out, index = Common_Dates)

Both methods give me this:

            LValue1  LValue2  SValue1  SValue2
1990-03-25     8.67     6.16     3.84     4.37
1990-03-27     4.03     8.54     7.92     7.79
1990-03-29     3.21     4.09     7.16     8.38
1990-03-31     4.93     2.86     7.00     6.92
1990-04-01     1.79     6.48     9.01     2.53
1990-04-02     6.38     5.74     5.38     4.03

This won't satisfy a lot of people I imagine, but it is the way I would tackle it. p.s. it would be great if you could do the leg work re: creating data frames in subsequent questions.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top