Question

I have a .json file extension (logs.json) that was sent to me with the following data in it (I am showing only some of it as there are over 2,000 entries):

["2012-03-01T00:05:55+00:00", "2012-03-01T00:06:23+00:00", "2012-03-01T00:06:52+00:00", "2012-03-01T00:11:23+00:00", "2012-03-01T00:12:47+00:00", "2012-03-01T00:12:54+00:00", "2012-03-01T00:16:14+00:00", "2012-03-01T00:17:31+00:00", "2012-03-01T00:21:23+00:00", "2012-03-01T00:21:26+00:00", "2012-03-01T00:22:25+00:00", "2012-03-01T00:28:24+00:00", "2012-03-01T00:31:21+00:00", "2012-03-01T00:32:20+00:00", "2012-03-01T00:33:32+00:00", "2012-03-01T00:35:21+00:00", "2012-03-01T00:38:14+00:00", "2012-03-01T00:39:24+00:00", "2012-03-01T00:43:12+00:00", "2012-03-01T00:46:13+00:00", "2012-03-01T00:46:31+00:00", "2012-03-01T00:48:03+00:00", "2012-03-01T00:49:34+00:00", "2012-03-01T00:49:54+00:00", "2012-03-01T00:55:19+00:00", "2012-03-01T00:56:27+00:00", "2012-03-01T00:56:32+00:00"]

Using Pandas, I did:

import pandas as pd
logs = pd.read_json('logs.json')
logs.head()

And I get the following:

                           0
0  2012-03-01T00:05:55+00:00
1  2012-03-01T00:06:23+00:00
2  2012-03-01T00:06:52+00:00
3  2012-03-01T00:11:23+00:00
4  2012-03-01T00:12:47+00:00

[5 rows x 1 columns]

Then, in order to assign the proper data type including the UTC zone, I do:

logs = pd.to_datetime(logs[0], utc=True)
logs.head()

And get:

0   2012-03-01 00:05:55
1   2012-03-01 00:06:23
2   2012-03-01 00:06:52
3   2012-03-01 00:11:23
4   2012-03-01 00:12:47
Name: 0, dtype: datetime64[ns]

Here are my questions:

  1. Is the above code correct to get my data in the right format?
  2. where did my UTC zone go? and what if I want to create a column with the corresponding PST time and add it to this dataset in a data frame format?
  3. I seem to recall that in order to obtain counts per day/week, or year, I need to add .day, .week, or .year somewhere (logs.day?), but I cannot figure it out and I am guessing that it is because of the current shape of my data. How do I get counts by day? week? year? so that I can plot the data? and how would I go with plotting the data?

Such simple questions that seem so hard for someone who is transitioning from R to using Python for Data Analysis! I hope you guys can help!

Was it helpful?

Solution

I think there may be a bug in the tz handling here, it's certainly possible that this should be converted by default (I was surprised that it wasn't, I suspect it's because it's just a list).

In [21]: s = pd.read_json(js, convert_dates=[0], typ='Series')  # more honestly this is a Series

In [22]: s.head()
Out[22]:
0   2012-03-01 00:05:55
1   2012-03-01 00:06:23
2   2012-03-01 00:06:52
3   2012-03-01 00:11:23
4   2012-03-01 00:12:47
dtype: datetime64[ns]

To get counts of year, month, etc. I would probably use a DatetimeIndex (at the moment date-like columns don't have year/month etc methods, though I think they (c|sh)ould):

In [23]: dti = pd.DatetimeIndex(s)

In [24]: s.groupby(dti.year).size()
Out[24]:
2012    27
dtype: int64

In [25]: s.groupby(dti.month).size()
Out[25]:
3    27
dtype: int64

Perhaps it makes more sense to view the data as a TimeSeries:

In [31]: ts = pd.Series(1, dti)

In [32]: ts.head()
Out[32]:
2012-03-01 00:05:55    1
2012-03-01 00:06:23    1
2012-03-01 00:06:52    1
2012-03-01 00:11:23    1
2012-03-01 00:12:47    1
dtype: int64

This way you can use resample:

In [33]: ts.resample('M', how='sum')
Out[33]:
2012-03-31    27
Freq: M, dtype: int64
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top