문제

There is a column in my csv called cost, which I want to sum based on another column, called factory, to basically create a breakdown of cost by factory. I have rows such as the following, where there are multiple costs for each factory:

Factory,Cost,Cost_Type
Bali,23,0
Sydney,21,1
Sydney,4,2
Denver,8,1
Bali,9,1

I'd like to be able to quickly sum the cost per factory, and save these values to a variable. I think one way to do this is by looping through a list of factories, which then loops through the csv. Here is where I've got to:

factories= ['Bali', 'Sydney', 'Denver']
totalcost = 0
balicost = 0
sydneycost = 0
denvercost = 0

for factory in factories:
    for row in csv.reader(costcsv):
        if row[0] == factory:

Where I'm stuck is that I don't know how to change the variable which is being added to for the different factories, balicost, sydneycost and denvercost. The simplified version, where I'm just getting the total of the cost column was as follows:

for row in csv.reader(costcsv):
        totalcost += float(row[1])

I'm more than open to different approaches than this (I believe dictionaries could come into it), and appreciate any points in the right direction.

도움이 되었습니까?

해결책

[Community wiki, because it's a little tangential.]

When you're processing tabular data in Python, you should consider the pandas library. The operation you want to perform is a groupby sum, and that's easily done in two lines:

df = pd.read_csv("factories.csv")
by_factory = df.groupby("Factory")["Cost"].sum()

which produces a Series object you can index into like a dictionary:

>>> by_factory
Factory
Bali       32
Denver      8
Sydney     25
Name: Cost, dtype: int64
>>> by_factory["Bali"]
32

Update, using the updated data-- if you also want to handle Cost_Type, you have several options. One is to select only the rows with Cost_Type == 1:

>>> df[df.Cost_Type == 1]
  Factory  Cost  Cost_Type
1  Sydney    21          1
3  Denver     8          1
4    Bali     9          1

[3 rows x 3 columns]
>>> df[df.Cost_Type == 1].groupby("Factory")["Cost"].sum()
Factory
Bali        9
Denver      8
Sydney     21
Name: Cost, dtype: int64

or you can expand the groupby and group on both Factory and Cost_Type simultaneously:

>>> df.groupby(["Cost_Type", "Factory"])["Cost"].sum()
Cost_Type  Factory
0          Bali       23
1          Bali        9
           Denver      8
           Sydney     21
2          Sydney      4
Name: Cost, dtype: int64

다른 팁

The easiest way is to use a dictionary to hold the count for each factories:

factoriescost = {}
for row in cvs.reader(costcsv):
    factory = row[0]
    if factory not in ('Bali', 'Sydney', 'Denver'):
        continue
    factorycost = factoriescost.get(factory, 0)
    factoriescost[factory] = factorycost + float(row[1])
totalcost = sum(factoriescost.itervalues())

Then you can use factoriescost to get the total for a given factory:

>>> print totalcost, factoriescost
65.0 {'Denver': 8.0, 'Sydney': 25.0, 'Bali': 32.0}
>>> print factoriescost['Bali']
32.0

You can use a dictionary as shown below. The code uses a try loop to sum the cost of the factories in the dictionary, if the factory is not already inside the dictionary then a KeyError will be thrown and so the factory is simply added.

a = [['Bali', 23],
     ['Sydney', 21],
     ['Sydney', 4],
     ['Denver', 8],
     ['Bali', 9]]

factories = dict()

for factory, cost in a:
    try:
        factories[factory] += cost
    except KeyError:
        factories[factory] = cost

print(factories)
# {'Denver': 8, 'Sydney': 25, 'Bali': 32}

In your example case you would replace the for loop with an appropriate one for csv.reader() along the lines of:

for factory, cost in csv.reader(costcsv):
    try:
        ...

Your csv should be:

Factory,Cost
Bali,23
Sydney,21
Sydney,4
Denver,8
Bali,9

And in python you can:

import csv

factories= ['Bali', 'Sydney', 'Denver']
totalcost = 0

sums = {}

with open('file.csv', 'rb') as f:
    f.next()                        # Jump to second row -> first : header
    reader = csv.reader(f)
    for row in reader:
        if row[0] not in sums:
            sums[row[0]] = int(row[1])
        else:
            sums[row[0]] += int(row[1])


for key,value in sums.items():
    totalcost = totalcost  + int(value)

The result look like:

print sums
>{'Denver': 8, 'Sydney': 25, 'Bali': 32}
print totalcost
>65

Rather than having separate variables, consider a dictionary or, easier, collections.defaultdict:

from collections import defaultdict

costs = defaultdict(float)

for line in csv.reader(costcsv):
    if len(line) == 2:
        factory, costs = line
        costs[factory] += float(cost)

This will give you an output where you can select any factory (not just the three you currently hard-code) and get the total cost

cost["denver"] == 8.0
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top