Question

I'm using pandas and I'm trying to find the difference in years when my data is grouped by labels and then by teams. i've tried to use a groupby for the problem that I'm dealing with but I can't quite get my desired result. here is the head(8) of my df

Team            Year    labels
Hawks           2001      b
Hawks           2004      b
Nets            1987      b
Nets            1988      a
Nets            2004      b
Nets            2001      a
Nets            2000      c
Hawks           2003      a

so, what confuses me is there are basically two groups that I want - labels and team. i then need to find the difference between the years which would be sorted and the result would be in the difference column. any help would be greatly appreciated.

Team            Year    labels  difference
Hawks           2001      b       NAN
Hawks           2004      b        1
Nets            1987      b       NAN
Nets            1988      a       NAN
Nets            2004      b       17
Nets            2001      a       13
Nets            2000      c       NAN
Hawks           2003      b        2
Was it helpful?

Solution

Not sure if label for the last row is supposed to be 'a' or 'b'. From your data snippet:

Hawks           2003      a

From you expected output:

Hawks           2003      b        2

I'll assume label is supposed to be 'b' so I can match your expected output:

You want do a groupby on ['Team', 'Labels'] which you can use to compute the year difference. But first sort your data by ['Team','labels','Year'] so your year difference calcs are correct:

In [8]: df.sort(['Team','labels','Year'],inplace=True)
In [9]: df
Out[9]: 
    Team  Year labels
0  Hawks  2001      b
7  Hawks  2003      b
1  Hawks  2004      b
3   Nets  1988      a
5   Nets  2001      a
2   Nets  1987      b
4   Nets  2004      b
6   Nets  2000      c

Now, do a groupby on ['Team','labels'] and compute the difference between years for each row in the group:

In [10]: df['difference'] = df.groupby(['Team','labels'])['Year'].diff(1)
In [11]: df
    Team  Year labels  difference
0  Hawks  2001      b         NaN
7  Hawks  2003      b           2
1  Hawks  2004      b           1
3   Nets  1988      a         NaN
5   Nets  2001      a          13
2   Nets  1987      b         NaN
4   Nets  2004      b          17
6   Nets  2000      c         NaN

And if for some reason you want to go back to the original order of the dataframe you can do the following:

In [12]: df.sort_index()
Out[12]: 
    Team  Year labels  difference
0  Hawks  2001      b         NaN
1  Hawks  2004      b           1
2   Nets  1987      b         NaN
3   Nets  1988      a         NaN
4   Nets  2004      b          17
5   Nets  2001      a          13
6   Nets  2000      c         NaN
7  Hawks  2003      b           2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top