Question

I am making a Python script that parses an Excel file using the xlrd library. What I would like is to do calculations on different columns if the cells contain a certain value. Otherwise, skip those values. Then store the output in a dictionary. Here's what I tried to do :

import xlrd


workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')

num_rows = worksheet.nrows -1
num_cells = worksheet.ncols - 1

first_col = 0
scnd_col = 1
third_col = 2

# Read Data into double level dictionary
celldict = dict()
for curr_row in range(num_rows)  :

    cell0_val = int(worksheet.cell_value(curr_row+1,first_col))
    cell1_val = worksheet.cell_value(curr_row,scnd_col)
    cell2_val = worksheet.cell_value(curr_row,third_col)

    if cell1_val[:3] == 'BL1' :
        if cell2_val=='toSkip' :
        continue
    elif cell1_val[:3] == 'OUT' :
        if cell2_val == 'toSkip' :
        continue
    if not cell0_val in celldict :
        celldict[cell0_val] = dict()
# if the entry isn't in the second level dictionary then add it, with count 1
    if not cell1_val in celldict[cell0_val] :
        celldict[cell0_val][cell1_val] = 1
        # Otherwise increase the count
    else :
        celldict[cell0_val][cell1_val] += 1

So here as you can see, I count the number of "cell1_val" values for each "cell0_val". But I would like to skip those values which have "toSkip" in the adjacent column's cell before doing the sum and storing it in the dict. I am doing something wrong here, and I feel like the solution is much more simple. Any help would be appreciated. Thanks.

Here's an example of my sheet :

cell0 cell1  cell2
12    BL1    toSkip
12    BL1    doNotSkip
12    OUT3   doNotSkip
12    OUT3   toSkip
13    BL1    doNotSkip
13    BL1    toSkip
13    OUT3   doNotSkip
Was it helpful?

Solution

Use collections.defaultdict with collections.Counter for your nested dictionary.

Here it is in action:

>>> from collections import defaultdict, Counter
>>> d = defaultdict(Counter)
>>> d['red']['blue'] += 1
>>> d['green']['brown'] += 1
>>> d['red']['blue'] += 1
>>> pprint.pprint(d)
{'green': Counter({'brown': 1}),
 'red': Counter({'blue': 2})}

Here it is integrated into your code:

from collections import defaultdict, Counter
import xlrd

workbook = xlrd.open_workbook('filter_data.xlsx')
worksheet = workbook.sheet_by_name('filter_data')

first_col = 0
scnd_col = 1
third_col = 2

celldict = defaultdict(Counter)
for curr_row in range(1, worksheet.nrows): # start at 1 skips header row

    cell0_val = int(worksheet.cell_value(curr_row, first_col))
    cell1_val = worksheet.cell_value(curr_row, scnd_col)
    cell2_val = worksheet.cell_value(curr_row, third_col)

    if cell2_val == 'toSkip' and cell1_val[:3] in ('BL1', 'OUT'):
        continue

    celldict[cell0_val][cell1_val] += 1

I also combined your if-statments and changed the calculation of curr_row to be simpler.

OTHER TIPS

It appears you want to skip the current line whenever cell2_val equals 'toSkip', so it would simplify the code if you add if cell2_val=='toSkip' : continue directly after computing cell2_val.

Also, where you have

# if the entry isn't in the second level dictionary then add it, with count 1
if not cell1_val in celldict[cell0_val] :
    celldict[cell0_val][cell1_val] = 1
    # Otherwise increase the count
else :
    celldict[cell0_val][cell1_val] += 1

the usual idiom is more like

celldict[cell0_val][cell1_val] = celldict[cell0_val].get(cell1_val, 0) + 1

That is, use a default value of 0 so that if key cell1_val is not yet in celldict[cell0_val], then get() will return 0.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top