Group data in csv by season and year using python and pandas

Question 1

This is the perfect example for a situation where itertools.groupby is your best friend!

Please forgive me for not expanding on your answer, but I'm not too familiar with pandas, so I opted to use the csv module.

By writing two methods for grouping the data(get_season and get_year), it's only a matter of iterating over the groups, and write the data to a new csv file.

import csv
from datetime import datetime
from itertools import groupby

LOOKUP_SEASON = {
    11: 'Winter',
    12: 'Winter',
    1: 'Winter',
    2: 'Spring',
    3: 'Spring',
    4: 'Spring',
    5: 'Summer',
    6: 'Summer',
    7: 'Summer',
    8: 'Autumn',
    9: 'Autumn',
    10: 'Autumn'
}


def get_season(row):
    date = datetime.strptime(row[0], '%d/%m/%Y')
    season = LOOKUP_SEASON[date.month]
    if season == 'Winter':
        if date.month == 1:
            last_year, next_year = date.year - 1, date.year
        else:
            last_year, next_year = date.year, date.year + 1
        return '{} {}/{}'.format(season, last_year, next_year)
    else:
        return '{} {}'.format(season, date.year)


def get_year(row):
    date = datetime.strptime(row[0], '%d/%m/%Y')
    if date.month < 8:
        return date.year - 1
    else:
        return date.year


with open('NJDATA.csv') as data_file, open('outfile.csv', 'wb') as out_file:
    headers = next(data_file)
    reader = csv.reader(data_file)
    writer = csv.writer(out_file)

    # Loop over groups distinguished by the "year" from Autumn->Summer,
    # defined by the `get_year` function
    for year, seasons in groupby(reader, get_year):
        mean_data = []
        # Loop over the data in the current year, grouped by season, defined
        # by the get_season method. Since the required "season string"
        # (e.g Autumn 1952) can be used as an identifier for the seasons,
        # the `get_season` method returns the specific string which is used
        # in the output, so you don't have to compile that one more time
        # inside the for loops
        for season_str, iter_data in groupby(seasons, get_season):
            data = list(iter_data)
            mean = sum([float(row[1]) for row in data]) / len(data)
            # Use the next line instead if you want to control the precision
            #mean = '{:.3f}'.format(sum([float(row[1]) for row in data]) / len(data))
            mean_data.extend([season_str, mean])
        writer.writerow(mean_data)

The basic idea here is to first group your data based on the year (Autumn -> Summer), and then group that data again by the season. The groupby function accepts two arguments; one sequence and one function. It iterates over the sequence, and whenever the returned value of the provided function changes, the preceding data is considered as a distinct group.

Consider this sample data:

01/01/1951,1
02/01/1951,-0.13161201
01/04/1951,1
02/04/1951,-0.13161201
03/04/1951,-0.271796132
04/06/1951,-0.258977158
05/06/1951,-0.198823057
06/08/1951,0.167794502
...
09/02/1952,-0.121824587

The first groupby call groups the data based on your year-definition (defined in get_year), giving the following groups of data:

# get_year returns 1950
01/01/1951,1
...
05/06/1951,-0.198823057

# get_year returns 1951 
06/08/1951,0.167794502
...
09/02/1952,-0.121824587

The next groupby method groups each of the above groups based on the season (defined in get_season). Lets consider the first group:

# get_season returns 'Winter 1950/1951'
01/01/1951,1
02/01/1951,-0.13161201

# get_season returns 'Spring 1951'
01/04/1951,1
02/04/1951,-0.13161201
03/04/1951,-0.271796132

# get_season returns 'Summer 1951'
04/06/1951,-0.258977158
05/06/1951,-0.198823057

Question 2

Here is a simple solution:

import pandas as pd

def year_and_season(x):
    season = lookup[x.month]
    year = x.year
    if x.month == 12:
        year += 1
    return (year, season)

data = pd.read_csv('example.csv', index_col=0, parse_dates=[0], dayfirst=True)
yearsAndSeason = data.groupby(year_and_season).mean()
yearsAndSeason.to_csv('results.csv')

Note the index column when reading was set to the date, so we can access its fields directly in the groupBy function. There we are returning a tuple with both year and season. You can call a mean function directly, instead of sum.

The results.csv does not look exactly like you expect, because the keys get printed in a tuple, but probably you can work that part out. Here is how it looks for me ...

$ cat results.csv
,Mean
"(1951, 'Winter')",0.009545620900000005
"(2099, 'Winter')",145.65558333333334

Question 3

I was running into the same kind of issue and found that the resample method can be used to do that just using the parameter 3M (for 3 months).

I discovered it thanks to this website who gives an example related to the question http://earthpy.org/time_series_analysis_with_pandas_part_2.html.

If you have a dataframe with index as pandas datetime object then all you need to do is ask to resample on 3 months basis.

In [108]:
data.head()
Out[108]:
         Sample Measurement
              mean
Date Local  
2006-01-01  50.820833
2006-01-02  41.900000
2006-01-03  45.870833
2006-01-04  50.850000
2006-01-05  37.116667

In[109]:
#88 in order to beginn the resampling in march
wm = data[88:].resample('3M', closed='left')
wm.head()
out[109]:
         Sample Measurement
              mean
Date Local  
2006-05-31  7.153622
2006-08-31  5.883025
2006-11-30  11.619724
2007-02-28  21.105789
2007-05-31  8.105313

This is on my dataset with daily values, I did loose the first three months of data but this is, I think, a really easy way to play with seasons