Generate large Excel files and response from API

https://softwareengineering.stackexchange.com/questions/406430

08-03-2021
|

Question

I need to generate a large Excel file (something around 50 megs) and send response to another API which will provide it to the front end for a download option.

My question is if it will be better to save generated Excel file and provide a path as a response to the API for the front end (something similar what web mail apps do) or to create a response to the front end API as a byte array instead of path to the file?

Solution

The answer to your question depends on the answer to this question:

Do you have retention requirements for the generated Excel file?

If you do not have any requirements to keep or maintain the Excel file and each download is unique, you owe it to yourself to keep it ephemeral. In short, it should be generated each time it is requested.

As I mentioned, byte arrays are problematic because what is only 5 MB today can grow to 50 MB or 500 MB later. You've already seen the need for larger spreadsheets. That's why I advocate using streams.

Current Java libraries to generate Excel workbooks (like POI) allow you to write directly to an OutputStream. If that OutputStream is the Servlet output stream, then you avoid doubling the memory requirements by copying all of that data to a byte array first.

Make sure your content headers are set appropriately:

Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Content-Disposition: attachment; filename="workbook.xlsx" (specifies the filename the browser will use when saving)

When to use a location outside the microservice

There are legitimate use cases for storing a file outside the confines of the typical microservice, but you do have to think about retrieval. For example, generating a file and putting it on a share location may sort of "work", but you can't guarantee the permissions on the share drive are valid for your intended user. They might not be able to get to the machine that has the file. That's when you need to look at ensuring the location for your generated file is always accessible to your user. You could use services like AWS S3 or equivalent for your provider; or you could use a caching server.

Reasons to use storage and return a location:

The process to generate the response is unacceptably long (i.e. the request times out consistently before the result is sent)
The Excel spreadsheet is generated asynchronously
The Excel spreadsheet is generated progressively (i.e. multiple services add a little bit more to the worksheet)
The spreadsheet must be retained for legal reasons

If you determine that you really do need to add the complexity of separating the generation from the delivery of the Excel spreadsheet, you need to start thinking about additional derived requirements:

What are the retention requirements? (i.e. how long before the data is stale)
What processes do you need to clean up unused documents? (i.e. prevent disk full events, or wasted resources that you are paying for)
Can the same document satisfy multiple users? If so, how do you ensure you point them all to the current version when requested
How do you ensure the user can access the requested document?
If it is generated asynchronously, how do you intend to notify the user that the file is generated?

If you write to a file system or equivalent, you inherit the risk that your service will fill up that resource without warning. That can impact more than just your system. I recommend avoiding this approach until you fully understand the impacts and how you intend to address the risks.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange