Conditional parsing and output of xlsx files with Openpyxl

https://stackoverflow.com/questions/16851167

30-05-2022
|

Question

I'm working through data for a research project. Output is in the form of .csv files, which have been converted to .xlsx files. There is a separate output file for each participant, with each file containing data on about 40 different measurements across several dozen (or so) stimuli. To make any sense of the data collected, we would need to look at each stimuli separately with relevant associated measurements. Each output file is large (50 columns by 60000 rows). I’m looking to parse the database using openpyxl to search for a cells in a pre-specified column with a particular string value. When such a cell is found, to then write that cell to a new workbook along with other specified columns in the same row.

For instance, parsing the following table, I’m trying to use openpyxl to search column A for ‘Slide 2’. When this value is found for a particular row, that cell is written to a new workbook along with the values in column C and D for that same row.

    A          B       C       D

1   Slide      Data1   Data2   Data3

2   Slide 1    1       2       3

3   Slide 2    4       5       6

4   Slide 2    7       8       9

Would write:

    A          B       C       D

2   Slide 2    5       6

3   

4

... or some similar format.

I would also look to fill column D and E with data from the next file, and F and G with data from the file after that (and so on), but I can probably figure that part out.

I’ve tried:

from openpyxl import load_workbook

wb = load_workbook(filename = r'test108.xlsx')

ws = wb.worksheets[0]

dest_filename = r'output.xlsx'

for x in range (0, 100): #0-100 as proof of concept before parsing entire worksheet
    if ws.cell(row = x, column =26) == ‘some_image.jpg':
        print (ws.cell(row =x, column =26), ws.cell(row = x, column = 10), ws.cell(row = x, column = 17))

wb.save = dest_filename

also with adding the following in an attempt to create a worksheet in memory within which to manipulate cells:

for i in range (0, 30):
    for j in range (0, 100):
        print (ws.cell(row =i, column=j))

... both with minor variations, but they all output a copy of the original file.

I’ve read and re-read the documentation for openpyxl but to no avail. There doesn’t seem to be any similar question on the forums here either.

Any insight in correctly manipulating and writing data would be greatly appreciated. I also hope this might help other people trying to make sense of huge datasets. Thanks in advance!

I'm on Windows 7 running Python3.3.2 (64 bit) with openpyxl-1.6.2. Data was originally in .csv format, so could be exported to .xls or other formats if this helps. I looked into xlutils (using xlwt and xlrd) briefly, but openpyxl worked better with xlsx files.

Edit

Many thanks to @MikeMüller for pointing out I needed two workbooks to transfer data between. That makes much more sense.

I now have the following, but it still returns an empty workbook. The original cells are not blank. (The commented lines are for simplification - without the indent, of course - but code not successful either way.)

import openpyxl

wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]

wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]

#n = 1

#for x in range (0, 1000):
    #if ws.cell(row = x, column = 27) == '7.image2.jpg':
        ws_out.cell(row = n, column = 1) == ws.cell(row = x, column = 26) #x changed
        ws_out.cell(row = n, column = 2) == ws.cell(row = x, column = 10) #x changed
        ws_out.cell(row = n, column = 3) == ws.cell(row = x, column = 17) #x changed
        #n += 1

wb_out.save('output108.xlsx')

Edit 2

I've updated the code to include the .value for cells, but it still returns a blank workbook.

import openpyxl

wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]

wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]

n = 1

for x in range (0, 1000):
    if ws.cell(row=x, column=27).value == '7.Image001.jpg':
        ws_out.cell(row=n, column=1).value = ws.cell(row=x, column=27).value
        ws_out.cell(row=n, column=2).value = ws.cell(row=x, column=10).value
        ws_out.cell(row=n, column=3).value = ws.cell(row=x, column=17).value
        n += 1

wb_out.save('output108.xlsx')

Summary for the next person with trouble:

You need to create two worksheets in memory. One to import your file, the to other to write to a new workbook file.

Use the cell.value call function to pull the text entered into each cell of your imported workbook, and set it = the desired cells in the exported workbook.

Make sure you start counting rows and columns at zero.

Solution

You are doing cell assignment incorrectly. Here's what should work:

import openpyxl

wb = openpyxl.load_workbook(filename = r'test108.xlsx')
ws = wb.worksheets[0]

wb_out = openpyxl.Workbook()
ws_out = wb_out.worksheets[0]

n = 1

for x in range (0, 1000):
    if ws.cell(row=x, column=27).value == '7.image2.jpg':
        ws_out.cell(row=n, column=1).value = ws.cell(row=x, column=26).value #x changed
        ws_out.cell(row=n, column=2).value = ws.cell(row=x, column=10).value #x changed
        ws_out.cell(row=n, column=3).value = ws.cell(row=x, column=17).value #x changed
        n += 1

wb_out.save('output108.xlsx')

OTHER TIPS

You need to open a second notebook for writing:

import openpyxl
wb_out = openpyxl.Workbook(dest_filename)
ws_out = wb_out.worksheets[0]

Put this in your loop:

ws_out.cell('cell indices here').value = desired_value

Save your file:

writer = openpyxl.ExelWriter(workbook=wb_out)
writer.save(dest_filename)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow