Question

I have a pandas dataframe with date range values as strings of the form '2014-10-16 - 2014-10-23' in one column and I would like to keep this column, but add new columns for the start and end year, month, and day (e.g. StartYear, EndDay, etc.).

Is there a compact way to do this using Python, ideally taking advantage of pandas time series features and working within the dataframe?

Was it helpful?

Solution

you can use .str.extract method; starting with:

>>> df
                      date
0  2014-01-24 - 2014-08-23
1  2012-03-12 - 2013-04-03
2  2014-10-16 - 2014-10-23

[3 rows x 1 columns]

the extraction part can be done by:

>>> cols = pd.MultiIndex.from_tuples([(x, y) for x in ['start', 'end'] for y in ['year', 'mon', 'day']])
>>> pat = r'(\d*)-(\d*)-(\d*) - (\d*)-(\d*)-(\d*)'
>>> xdf = pd.DataFrame(df.date.str.extract(pat).values, columns=cols, dtype=np.int64)
>>> xdf
   start             end          
    year  mon  day  year  mon  day
0   2014    1   24  2014    8   23
1   2012    3   12  2013    4    3
2   2014   10   16  2014   10   23

[3 rows x 6 columns]

and if you want to concatenate with original data-frame:

>>> pd.concat([df, xdf], axis=1)

edit: seems .str.findall would fit better:

>>> pd.DataFrame(df.date.str.findall('\d+').tolist(), dtype=np.int64, columns=cols)
   start             end          
    year  mon  day  year  mon  day
0   2014    1   24  2014    8   23
1   2012    3   12  2013    4    3
2   2014   10   16  2014   10   23

[3 rows x 6 columns]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top