Reading extracted XLSX files with OpenPyXL

https://stackoverflow.com/questions/14131556

12-01-2022
|

Question

So I've been using Python 3.2, and OpenPyXL's iterable workbook as demonstrated here in the "Optimized Reader" example.

My problem arises when I try to use this strategy to read a file or files that I've extracted from a simple .zip archive (both manually and through the python zipfile package). When I call .get_highest_column() I get "A" and .get_highest_row() I get 1, and when asked to print each cell's value as shown here:

wb = load_workbook(filename = file_name, use_iterators = True)
ws = wb.worksheets[0]    # Only need to read the first sheet, nothing fancy
for row in ws.iter_rows():
    for entry in row:
        print(entry.internal_value)

It prints the values in A1, A2, A3, A4, A5, A6, and A7, regardless of how large the file actually is. There isn't any reason for this in the file itself, and it will open in Excel perfectly fine. I'm quite stumped as to why it does it like this, but I assume that the unzipped XLSX is formatted differently prior to being saved from within Excel, and OpenPyXL cannot interpret it correctly. I even renamed the '.xlsx' to '.zip' so that I could explore the file and examine the differences, but couldn't tell much except that the one saved from Excel also has a subfolder called "theme" within the "xl" folder that the previous version does not, with font and formatting data.

IMPORTANT NOTE: When I open it and re-save it with the same filename from within Excel and then run this bit of code, it works perfectly - returns correct greatest row and column values, and correctly prints every cell value. I've tried instead saving the workbook through OpenPyXL immediately after opening it, but this yields the same erroneous results.

Basically, I need to discover a method to properly extract a .xlsx file from a .zip file so that it can be read with OpenPyXL. There are many many files that need to be processed like this, so it must be external to Excel, and hopefully as efficient as possible.

Cheers!

Solution

It sounds like this has nothing to do with the extraction from the zipfile, as the problem also occurs if you manually extract the files. I would try to store the files opened and saved with Excel in a zipfile and see what happens. If that works, then clearly the way the original .xlsx files were generated is the problem. I strongly suspect that to be the case.

If that is the problem, see if you can extract the .xlsx files (they are zipfiles themselves) and compare the one you re-saved with Excel to the original problematic one. xml does not compare easily as Excel can rearrange most things at will, but you might be able to do a diff.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow