Pandas summing up different data types using counts and conditions

https://stackoverflow.com/questions/21666202

09-10-2022
|

Domanda

I am working with a large data set so I am going to create similar conditions below:

Lets say we are using this data set:

import pandas as pd

df=pd.DataFrame({'Location': [ 'NY', 'SF', 'NY', 'NY', 'SF', 'SF', 'TX', 'TX', 'TX', 'DC'],
                 'Class': ['H','L','H','L','L','H', 'H','L','L','M'],
                 'Address': ['12 Silver','10 Fak','12 Silver','1 North','10 Fak','2 Fake', '1 Red','1 Dog','2 Fake','1 White'],
                 'Score':['4','5','3','2','1','5','4','3','2','1',]})

So I want the rows to be unique value in df.Location

The first column will be the number of of data entries for each location. I can get this separately by:

df[df['Location'] =='SF'].count()['Location']
df[df['Location'] =='NY'].count()['Location']
df[df['Location'] =='TX'].count()['Location']
df[df['Location'] =='DC'].count()['Location']

Second, third and fourth columns I want to sum the different types in Classes (H,L,M). I know I can do this by:

#Second Col for NY
print (df[(df.Location =='NY') & (df.Class=='H')].count()['Class'])
#Third Col for NY
print (df[(df.Location =='NY') & (df.Class=='L')].count()['Class'])
#Fourth Col for NY
print (df[(df.Location =='NY') & (df.Class=='M')].count()['Class'])

I am guessing this would work with a pivot table but since I was using a dataframe everything got mixed up.

For the fifth column I want the consolidate the number of unique values for each Address. For example in NY the value should be 2 since there are two unique values and a duplicate of '12 Silver'

print (df[(df.Location =='NY')].Address)
>>> 
0    12 Silver
2    12 Silver
3      1 North
Name: Address, dtype: object

I guess this can be doe by groupby. But I always get confused when using it. I can also use .drop_duplicates then count to get a numerical value

The sixth column should be if the values is less than the integer 4. So the value for NY should be

print (df[(df.Location =='NY') & (df.Score.astype(float) < 4)].count()['Score'])

So what is a good way to make a dataframe like this where the rows are unique location with the columns described above?

It should look something like:

   Pop  H   L  M  HH L4
DC  1   0   0  1  1  1
NY  3   2   1  0  2  2
SF  3   1   2  0  2  1
TX  3   1   2  0  3  2

Since I know more or less how to get each individual component I can use a for loop through an array but there should be easier ways of doing this.

Soluzione

While with enough stacking tricks you might be able to do this all in one go, I don't think it'd be worth it. You have a pivot operation and a bunch of groupby operations. So do them separately -- which is easy -- and then combine the results.

Step #1 is to make Score a float column; it's better to get the types right before you start processing.

>>> df["Score"] = df["Score"].astype(float)

Then we'll make a new frame with the groupby-like columns. We could do this by passing .agg a dictionary but we'd have to rename the columns afterwards anyway, so there's not much point.

>>> gg = df.groupby("Location")
>>> summ = pd.DataFrame({"Pop": gg.Location.count(),
...                      "HH": gg.Address.nunique(),
...                      "L4": gg.Score.apply(lambda x: (x < 4).sum())})
>>> summ
          HH  L4  Pop
Location             
DC         1   1    1
NY         2   2    3
SF         2   1    3
TX         3   2    3

[4 rows x 3 columns]

Then we can pivot:

>>> class_info = df.pivot_table(rows="Location", cols="Class", aggfunc='size', fill_value=0)
>>> class_info
Class     H  L  M
Location         
DC        0  0  1
NY        2  1  0
SF        1  2  0
TX        1  2  0

[4 rows x 3 columns]

and combine:

>>> new_df = pd.concat([summ, class_info], axis=1)
>>> new_df
          HH  L4  Pop  H  L  M
Location                      
DC         1   1    1  0  0  1
NY         2   2    3  2  1  0
SF         2   1    3  1  2  0
TX         3   2    3  1  2  0

[4 rows x 6 columns]

You can reorder this as you like.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow