Domanda

I am working with a large data set so I am going to create similar conditions below:

Lets say we are using this data set:

import pandas as pd

df=pd.DataFrame({'Location': [ 'NY', 'SF', 'NY', 'NY', 'SF', 'SF', 'TX', 'TX', 'TX', 'DC'],
                 'Class': ['H','L','H','L','L','H', 'H','L','L','M'],
                 'Address': ['12 Silver','10 Fak','12 Silver','1 North','10 Fak','2 Fake', '1 Red','1 Dog','2 Fake','1 White'],
                 'Score':['4','5','3','2','1','5','4','3','2','1',]})

So I want the rows to be unique value in df.Location

The first column will be the number of of data entries for each location. I can get this separately by:

df[df['Location'] =='SF'].count()['Location']
df[df['Location'] =='NY'].count()['Location']
df[df['Location'] =='TX'].count()['Location']
df[df['Location'] =='DC'].count()['Location']

Second, third and fourth columns I want to sum the different types in Classes (H,L,M). I know I can do this by:

#Second Col for NY
print (df[(df.Location =='NY') & (df.Class=='H')].count()['Class'])
#Third Col for NY
print (df[(df.Location =='NY') & (df.Class=='L')].count()['Class'])
#Fourth Col for NY
print (df[(df.Location =='NY') & (df.Class=='M')].count()['Class'])

I am guessing this would work with a pivot table but since I was using a dataframe everything got mixed up.

For the fifth column I want the consolidate the number of unique values for each Address. For example in NY the value should be 2 since there are two unique values and a duplicate of '12 Silver'

print (df[(df.Location =='NY')].Address)
>>> 
0    12 Silver
2    12 Silver
3      1 North
Name: Address, dtype: object

I guess this can be doe by groupby. But I always get confused when using it. I can also use .drop_duplicates then count to get a numerical value

The sixth column should be if the values is less than the integer 4. So the value for NY should be

print (df[(df.Location =='NY') & (df.Score.astype(float) < 4)].count()['Score'])

So what is a good way to make a dataframe like this where the rows are unique location with the columns described above?

It should look something like:

   Pop  H   L  M  HH L4
DC  1   0   0  1  1  1
NY  3   2   1  0  2  2
SF  3   1   2  0  2  1
TX  3   1   2  0  3  2

Since I know more or less how to get each individual component I can use a for loop through an array but there should be easier ways of doing this.

È stato utile?

Soluzione

While with enough stacking tricks you might be able to do this all in one go, I don't think it'd be worth it. You have a pivot operation and a bunch of groupby operations. So do them separately -- which is easy -- and then combine the results.

Step #1 is to make Score a float column; it's better to get the types right before you start processing.

>>> df["Score"] = df["Score"].astype(float)

Then we'll make a new frame with the groupby-like columns. We could do this by passing .agg a dictionary but we'd have to rename the columns afterwards anyway, so there's not much point.

>>> gg = df.groupby("Location")
>>> summ = pd.DataFrame({"Pop": gg.Location.count(),
...                      "HH": gg.Address.nunique(),
...                      "L4": gg.Score.apply(lambda x: (x < 4).sum())})
>>> summ
          HH  L4  Pop
Location             
DC         1   1    1
NY         2   2    3
SF         2   1    3
TX         3   2    3

[4 rows x 3 columns]

Then we can pivot:

>>> class_info = df.pivot_table(rows="Location", cols="Class", aggfunc='size', fill_value=0)
>>> class_info
Class     H  L  M
Location         
DC        0  0  1
NY        2  1  0
SF        1  2  0
TX        1  2  0

[4 rows x 3 columns]

and combine:

>>> new_df = pd.concat([summ, class_info], axis=1)
>>> new_df
          HH  L4  Pop  H  L  M
Location                      
DC         1   1    1  0  0  1
NY         2   2    3  2  1  0
SF         2   1    3  1  2  0
TX         3   2    3  1  2  0

[4 rows x 6 columns]

You can reorder this as you like.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top