Question

I'm giving myself a crash course in using python and pandas for data crunching. I finally got sick of using spreadsheets and wanted something more flexible than R so I decided to give this a spin. It's a really slick interface and I'm having a blast playing around with it. However, in researching different tricks, I've been unable to find just a cheat sheet of basic spreadsheet functions, particularly with regard to adding formulas to new columns in dataframes that reference other columns.

I was wondering if someone might give me the recommended code to accomplish the 6 standard spreadsheet operations below, just so I can get a better idea of how it works. If you'd like to see a full size rendering of the image just click here

Pandas spreadsheet example

If you'd like to see the spreadsheet for yourself, click here.

I'm already somewhat familiar with adding columns to dataframes, it's mainly the cross-referencing of specific cells that I'm struggling with. Basically, I'm anticipating the answer loosely looking something like:

table['NewColumn']=(table['given_column']+magic-code-that-I-don't-know).astype(float-or-int-or-whatever)

If I would do well to use an additional library to accomplish any of these functions, feel free to suggest it.

Was it helpful?

Solution

In general, you want to be thinking about vectorized operations on columns instead of operations on specific cells.

So, for example, if you had a data column, and you wanted another column that was the same but with each value multiplied by 3, you could do this in two basic ways. The first is the "cell-by-cell" operation.

df['data_prime'] = df['data'].apply(lambda x: 3*x)

The second is the vectorized way:

df['data_prime'] = df['data'] * 3

So, column-by-column in your spreadsheet:

Count (you can add 1 to the right side if you want it to start at 1 instead of 0):

df['count'] = pandas.Series(range(len(df))

Running total:

df['running total'] = df['data'].cumsum()

Difference from a scalar (set the scalar to a particular value in your df if you want):

df['diff'] = scalar - df['data']

Moving average:

df['moving average'] = df['running total'] / df['count'].astype('float')

Basic formula from your spreadsheet:

I think you have enough to this on your own.

If statement:

df['new column'] = 0
mask = df['data column'] >= 3
df.loc[mask, 'new column'] = 1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top