문제

I have a indicator dataframe (1 to include data, 0 to not include) called indicator_df where the index is a timeseries and the columns are a MultiIndex like the following:

               Item0     Item1   
                A  D      A  C
2014-04-02      0  1      0  1
2014-04-03      0  1      0  1
2014-04-04      1  1      0  1

In addition I have a timeseries dataframe called data_df with the same index and the matching sub-columns

            A  B  C  D
2014-04-02  3  4  2 -3
2014-04-03  1  3 -2  1
2014-04-04 -1 -5  0 -2

What I'm looking for is a compact way to get a timeseries dataframe with the columns ['Item0', 'Item1'] where the each column is the sum of the data included by the indicator

new_df[col] = indicator_df[col].mul(data_df).sum(axis=1)

            Item0  Item1
2014-04-02     -3      2
2014-04-03      1     -2
2014-04-04     -3      0

I could just loop through the first level of the MultiIndex and concat each column, but I feel that I should be able to do this without a loop. Maybe with a clever groupby?

도움이 되었습니까?

해결책

So here's a less succinct version, but it's slightly more in the idiom of pandas:

First pandas.melt your data. It's easier to work with two DataFrames that are each just a collection of columns with some in common, than it is to try and do MultiIndex acrobatics.

In [127]: dfm = pd.melt(df, var_name=['items', 'labels'], id_vars=['index'], value_name='indicator')

In [128]: dfm
Out[128]:
        index  items labels  indicator
0  2014-04-02  Item0      A          0
1  2014-04-03  Item0      A          0
2  2014-04-04  Item0      A          1
3  2014-04-02  Item0      D          1
4  2014-04-03  Item0      D          1
5  2014-04-04  Item0      D          1
6  2014-04-02  Item1      A          0
7  2014-04-03  Item1      A          0
8  2014-04-04  Item1      A          0
9  2014-04-02  Item1      C          1
10 2014-04-03  Item1      C          1
11 2014-04-04  Item1      C          1

[12 rows x 4 columns]

In [129]: df2m = pd.melt(df2, var_name=['labels'], id_vars=['index'], value_name='value')

In [130]: df2m
Out[130]:
        index labels  value
0  2014-04-02      A      3
1  2014-04-03      A      1
2  2014-04-04      A     -1
3  2014-04-02      B      4
4  2014-04-03      B      3
5  2014-04-04      B     -5
6  2014-04-02      C      2
7  2014-04-03      C     -2
8  2014-04-04      C      0
9  2014-04-02      D     -3
10 2014-04-03      D      1
11 2014-04-04      D     -2

[12 rows x 3 columns]

Now you have two frames with some common columns ("labels" and "index") that you can then use in a pandas.merge:

In [140]: merged = pd.merge(dfm, df2m, on=['labels', 'index'], how='outer')

In [141]: merged
Out[141]:
        index  items labels  indicator  value
0  2014-04-02  Item0      A          0      3
1  2014-04-02  Item1      A          0      3
2  2014-04-03  Item0      A          0      1
3  2014-04-03  Item1      A          0      1
4  2014-04-04  Item0      A          1     -1
5  2014-04-04  Item1      A          0     -1
6  2014-04-02  Item0      D          1     -3
7  2014-04-03  Item0      D          1      1
8  2014-04-04  Item0      D          1     -2
9  2014-04-02  Item1      C          1      2
10 2014-04-03  Item1      C          1     -2
11 2014-04-04  Item1      C          1      0
12 2014-04-02    NaN      B        NaN      4
13 2014-04-03    NaN      B        NaN      3
14 2014-04-04    NaN      B        NaN     -5

[15 rows x 5 columns]

Since indicator is really just a boolean indexer, drop its NaNs and convert it to bool dtype

In [147]: merged.dropna(subset=['indicator'], inplace=True)

In [148]: merged['indicator'] = merged.indicator.copy().astype(bool)

In [149]: merged
Out[149]:
        index  items labels indicator  value
0  2014-04-02  Item0      A     False      3
1  2014-04-02  Item1      A     False      3
2  2014-04-03  Item0      A     False      1
3  2014-04-03  Item1      A     False      1
4  2014-04-04  Item0      A      True     -1
5  2014-04-04  Item1      A     False     -1
6  2014-04-02  Item0      D      True     -3
7  2014-04-03  Item0      D      True      1
8  2014-04-04  Item0      D      True     -2
9  2014-04-02  Item1      C      True      2
10 2014-04-03  Item1      C      True     -2
11 2014-04-04  Item1      C      True      0

[12 rows x 5 columns]

Now slice with indicator and use pivot_table to get your desired result:

In [150]: merged.loc[merged.indicator].pivot_table(values='value', index='index', columns=['items'], aggfunc=sum)
Out[150]:
items       Item0  Item1
index
2014-04-02     -3      2
2014-04-03      1     -2
2014-04-04     -3      0

[3 rows x 2 columns]

This may seem like a lot, but that might be because I'm writing out each step. It amounts to about five lines of code.

다른 팁

Would this be good enough for you, since the indices are indentical?

pd.DataFrame(np.array([(DF1[item]*DF2[DF[item].columns]).sum(axis=1) for item in ['Item0', 'Item1']]).T,
             columns=['Item0', 'Item1'], index=DF1.index)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top