I'm new into pandas and try to learn how to process my multi-dimensional data.
My data
Let's assume, my data is a big CSV of the columns ['A', 'B', 'C', 'D', 'E', 'F', 'G']. This data describes some simulation results, where ['A', 'B', ..., 'F'] are simulation parameters and 'G' is one of the ouputs (only existing output in this example!).
EDIT / UPDATE:
As Boud suggested in the comments, let's generate some data which is compatible to mine:
import pandas as pd
import itertools
import numpy as np
npData = np.zeros(5000, dtype=[('A','i4'),('B','f4'),('C','i4'), ('D', 'i4'), ('E', 'f4'), ('F', 'i4'), ('G', 'f4')])
A = [0,1,2,3,6] # param A: int
B = [1000.0, 10.000] # param B: float
C = [100,150,200,250,300] # param C: int
D = [10,15,20,25,30] # param D: int
E = [0.1, 0.3] # param E: float
F = [0,1,2,3,4,5,6,7,8,9] # param F = random-seed = int -> 10 runs per scenario
# some beta-distribution parameters for randomizing the results in column "G"
aDistParams = [ (6,1),
(5,2),
(4,3),
(3,4),
(2,5),
(1,6),
(1,7) ]
counter = 0
for i in itertools.product(A,B,C,D,E,F):
npData[counter]['A'] = i[0]
npData[counter]['B'] = i[1]
npData[counter]['C'] = i[2]
npData[counter]['D'] = i[3]
npData[counter]['E'] = i[4]
npData[counter]['F'] = i[5]
np.random.seed = i[5]
npData[counter]['G'] = np.random.beta(a=aDistParams[i[0]][0], b=aDistParams[i[0]][1])
counter += 1
data = pd.DataFrame(npData)
data = data.reindex(np.random.permutation(data.index)) # shuffle rows because my original data doesn't give any guarantees
Because the parameters ['A', 'B', ..., 'F'] are generated as a cartesian-product (meaning: nested for-loops; a priori), i want to use groupby for obtaining each possible 'simulation scenario' before analysing the output.
The parameter 'F' describe multiple runs for each scenario (each scenario defined by 'A', 'B', ..., 'E' ; let's assume, that 'F' is the random-seed), so my code becomes:
grouped = data.groupby(['A','B','C','D','E'])
# -> every group defines one simulation scenario
grouped_agg = grouped.agg(({'G' : np.mean}))
# -> the mean of the simulation output in 'G' over 'F' is calculated for each group/scenario
What do i want to do now?
I: display all the (unique) values of each scenario-parameter within these groups -> grouped_agg gives me an iterable of tuples, where for example all the entries at each position 0 give me all the values for 'A' (so with a few lines of python i would get my unique values, but maybe there is a function for that)
- Update: my approach
list(set(grouped_agg.index.get_level_values('A')))
(when interested in 'A'; using set for obtaining unique values; probably not the stuff you want to do, if you need high performance)
- =>
[0, 1, 2, 3, 6]
II: generate some plots (of lower dimension) -> i need to make some variables constant and filter/select my data before plotting (therefore step I needed) =>
- 'B' const
- 'C', const
- 'E' const
- 'D' = x-axis
- 'G' = y-axis / output from my aggregation
- 'A' = one more dimension = multiple colors within 2d-plot -> one G/y-axis for each value of 'A'
How would i generate a plot like that?
I think, that reshaping my data is the key step and pandas plotting capabilities will handle it then. Maybe achieving a shape, where there are 5 columns (one for each value of parameter A) and the corresponding G-values for each index-selection + param-A-selection is enough, but i wasn't able to achieve that form yet.
Thanks for your input!
(i'm using pandas 0.12 within enthought canopy)
Sascha