Domanda

This is an extension of an earlier question.

I want to use Pandas and Python to iterate through my .csv file and group the data by season (and year) calculating the mean for each season in the year. Currently the quarterly script does Jan-Mar, Apr-Jun etc. I want the seasons correlate to months by

- 11: 'Winter', 12: 'Winter', 1: 'Winter', 2: 'Spring', 3: 'Spring', 4: 'Spring', 5: 'Summer', 6: 'Summer', 7: 'Summer', \ 8: 'Autumn', 9: 'Autumn', 10: 'Autumn'

I have the following data:

Date,HAD
01/01/1951,1
02/01/1951,-0.13161201
03/01/1951,-0.271796132
04/01/1951,-0.258977158
05/01/1951,-0.198823057
06/01/1951,0.167794502
07/01/1951,0.046093808
08/01/1951,-0.122396694
09/01/1951,-0.121824587
10/01/1951,-0.013002463

...

all the way up to

20/12/2098,62.817
21/12/2098,59.998
22/12/2098,50.871
23/12/2098,88.405
24/12/2098,81.154
25/12/2098,83.617
26/12/2098,120.675
27/12/2098,273.795
28/12/2098,316.324
29/12/2098,260.951
30/12/2098,198.505
31/12/2098,150.755

This is the code from the earlier question which works

import pandas as pd
import os
import re

lookup = {
    11: 'Winter',
    12: 'Winter',
    1: 'Winter',
    2: 'Spring',
    3: 'Spring',
    4: 'Spring',
    5: 'Summer',
    6: 'Summer',
    7: 'Summer',
    8: 'Autumn',
    9: 'Autumn',
    10: 'Autumn'
}

os.chdir('C:/Users/n-jones/testdir/output/')

for fname in os.listdir('.'):
    if re.match(".*csv$", fname):
        data = pd.read_csv(fname, parse_dates=[0], dayfirst=True)
        data['Season'] = data['Date'].apply(lambda x: lookup[x.month])
        data['count'] = 1
        data = data.groupby(['Season'])['HAD', 'count'].sum()
        data['mean'] = data['HAD'] / data['count']
        data.to_csv('C:/Users/n-jones/testdir/season/' + fname)

I want my output csv file to be:

Autumn 1951, Mean, Winter 1951/52, Mean, Spring 1952, Mean, Summer 1952, Mean,
Autumn 1952, Mean, Winter 1952/53, Mean, Spring 1953, Mean, Summer 1953, Mean,

and so on...

I hope this makes some sense.

Thank you in advance!

È stato utile?

Soluzione

This is the perfect example for a situation where itertools.groupby is your best friend!

Please forgive me for not expanding on your answer, but I'm not too familiar with pandas, so I opted to use the csv module.

By writing two methods for grouping the data(get_season and get_year), it's only a matter of iterating over the groups, and write the data to a new csv file.

import csv
from datetime import datetime
from itertools import groupby

LOOKUP_SEASON = {
    11: 'Winter',
    12: 'Winter',
    1: 'Winter',
    2: 'Spring',
    3: 'Spring',
    4: 'Spring',
    5: 'Summer',
    6: 'Summer',
    7: 'Summer',
    8: 'Autumn',
    9: 'Autumn',
    10: 'Autumn'
}


def get_season(row):
    date = datetime.strptime(row[0], '%d/%m/%Y')
    season = LOOKUP_SEASON[date.month]
    if season == 'Winter':
        if date.month == 1:
            last_year, next_year = date.year - 1, date.year
        else:
            last_year, next_year = date.year, date.year + 1
        return '{} {}/{}'.format(season, last_year, next_year)
    else:
        return '{} {}'.format(season, date.year)


def get_year(row):
    date = datetime.strptime(row[0], '%d/%m/%Y')
    if date.month < 8:
        return date.year - 1
    else:
        return date.year


with open('NJDATA.csv') as data_file, open('outfile.csv', 'wb') as out_file:
    headers = next(data_file)
    reader = csv.reader(data_file)
    writer = csv.writer(out_file)

    # Loop over groups distinguished by the "year" from Autumn->Summer,
    # defined by the `get_year` function
    for year, seasons in groupby(reader, get_year):
        mean_data = []
        # Loop over the data in the current year, grouped by season, defined
        # by the get_season method. Since the required "season string"
        # (e.g Autumn 1952) can be used as an identifier for the seasons,
        # the `get_season` method returns the specific string which is used
        # in the output, so you don't have to compile that one more time
        # inside the for loops
        for season_str, iter_data in groupby(seasons, get_season):
            data = list(iter_data)
            mean = sum([float(row[1]) for row in data]) / len(data)
            # Use the next line instead if you want to control the precision
            #mean = '{:.3f}'.format(sum([float(row[1]) for row in data]) / len(data))
            mean_data.extend([season_str, mean])
        writer.writerow(mean_data)

The basic idea here is to first group your data based on the year (Autumn -> Summer), and then group that data again by the season. The groupby function accepts two arguments; one sequence and one function. It iterates over the sequence, and whenever the returned value of the provided function changes, the preceding data is considered as a distinct group.

Consider this sample data:

01/01/1951,1
02/01/1951,-0.13161201
01/04/1951,1
02/04/1951,-0.13161201
03/04/1951,-0.271796132
04/06/1951,-0.258977158
05/06/1951,-0.198823057
06/08/1951,0.167794502
...
09/02/1952,-0.121824587

The first groupby call groups the data based on your year-definition (defined in get_year), giving the following groups of data:

# get_year returns 1950
01/01/1951,1
...
05/06/1951,-0.198823057

# get_year returns 1951 
06/08/1951,0.167794502
...
09/02/1952,-0.121824587

The next groupby method groups each of the above groups based on the season (defined in get_season). Lets consider the first group:

# get_season returns 'Winter 1950/1951'
01/01/1951,1
02/01/1951,-0.13161201

# get_season returns 'Spring 1951'
01/04/1951,1
02/04/1951,-0.13161201
03/04/1951,-0.271796132

# get_season returns 'Summer 1951'
04/06/1951,-0.258977158
05/06/1951,-0.198823057

Altri suggerimenti

Here is a simple solution:

import pandas as pd

def year_and_season(x):
    season = lookup[x.month]
    year = x.year
    if x.month == 12:
        year += 1
    return (year, season)

data = pd.read_csv('example.csv', index_col=0, parse_dates=[0], dayfirst=True)
yearsAndSeason = data.groupby(year_and_season).mean()
yearsAndSeason.to_csv('results.csv')

Note the index column when reading was set to the date, so we can access its fields directly in the groupBy function. There we are returning a tuple with both year and season. You can call a mean function directly, instead of sum.

The results.csv does not look exactly like you expect, because the keys get printed in a tuple, but probably you can work that part out. Here is how it looks for me ...

$ cat results.csv
,Mean
"(1951, 'Winter')",0.009545620900000005
"(2099, 'Winter')",145.65558333333334

I was running into the same kind of issue and found that the resample method can be used to do that just using the parameter 3M (for 3 months).

I discovered it thanks to this website who gives an example related to the question http://earthpy.org/time_series_analysis_with_pandas_part_2.html.

If you have a dataframe with index as pandas datetime object then all you need to do is ask to resample on 3 months basis.

In [108]:
data.head()
Out[108]:
         Sample Measurement
              mean
Date Local  
2006-01-01  50.820833
2006-01-02  41.900000
2006-01-03  45.870833
2006-01-04  50.850000
2006-01-05  37.116667

In[109]:
#88 in order to beginn the resampling in march
wm = data[88:].resample('3M', closed='left')
wm.head()
out[109]:
         Sample Measurement
              mean
Date Local  
2006-05-31  7.153622
2006-08-31  5.883025
2006-11-30  11.619724
2007-02-28  21.105789
2007-05-31  8.105313

This is on my dataset with daily values, I did loose the first three months of data but this is, I think, a really easy way to play with seasons

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top