Should I always store parsed data in database before manipulating?

https://softwareengineering.stackexchange.com/questions/307353

11-12-2020
|

Question

I am about to start a new project which involves taking an excel file, parsing the data (php-excel-reader) and then using the parsed values in a HTML email.

My question is pretty simple. Is it better practice to store the parsed data in a database first and then use the data however I wish?

For me it makes more sense as then I don't need to re-parse if errors occur when sending the email for example.

Solution

I think parsing the file and storing the data to a database would be a good idea.

It provides a transactional history so you can retry failed messages, audit records sent, and provide reporting.

That said, if you have no requirements to support any of those functions and no possibility of having them in the future, writing to a database would just be unnecessary overhead.

OTHER TIPS

For me it makes more sense as then I don't need to re-parse if errors occur when sending the email for example.

In such a case, the main decision criteria are simplicity and performance (which depend not only on the process you are implementing, but also how you do it).

For example, when the running time for re-parsing the input file is negligible, and you need the full data of the excel sheet again in case of an email sending error, it will probably be simpler and faster to reparse the excel file again and not take the burden of storing the data in a database first, and retrieving it again when the email has to be resent. Reparsing the same data twice is not "bad" just because it happens twice, as long as it gives you reliably the same output from the same input, and as long as the parsing does not involve a complex, very slow transformation process.

If the parsing itself can show up errors which must be corrected first (maybe the spreadsheet has not the expected structure?), or if there is a clean-up step involved, the situation starts to change. Then you need an additional intermediate data store for the cleaned data anyway. That could be a new Excel file, of course, and that may be still the most simple solution. But if one needs to integrate additional data from other data sources as well, if you need to enforce some kind of relational constraints on the data, some kind of lightweight database might be a solution which serves you better.

However, lets assume you will have to generate 1000 emails from one excel file, each based on a different part of the data in the file. Now in the process of sending the mails, 5 of the mails come back and you need to retrieve the data for exactly those 5 receivers for preparing a resend. For such a case chances are high that by using database to re-query exactly the needed data only for those 5 persons, you can make the process simpler and faster. And if you need to store additional metadata like how many sending attempts for each receiver you had, a database gives you a place where you can introduce additional tables or columns for this metadata.

So the answer is it depends. A database introduces additional overhead, but also gives you benefits, this is a trade-off. And if you currently do not know the upcoming requirements well enough, start with the simpler approach first (which is probably not using a database initially), but make sure your HTML generation uses some intermediate data structures. That gives you the option to switch to a database later, when you get requirements that demand it.

^{(since you code in PHP, I am guessing that the Excel file is uploaded in some browser, so is coming from the Internet; if it is not the case, ignore my answer)}

Is it better practice to store the parsed data in a database first and then use the data however I wish?

I believe that yes. The data is coming from an untrusted source, the "bad Internet", so parsing it carefully is validating the data.

^{(a malicious hacker might "fake" some HTTP requests and build bad ones)}

In your database, you want to store somehow trusted (non-malicious) data.

Things could be different in an intranet (internal to a corporation) web application: then you could somehow trust your users, and data validation might be slightly less important.

Always beware of code injection.

Depends on the business model. Let us say it like this.

If the processed excel file should generate a different result to the previous one, I would say store and according the request the model process the output different previous according the uri.

But if the data is running on the same data the same routine (request), with the service resultset, process it and then store it in the DB.

There is a very old joke circa 1916. A harassed young lieutenant sent the message "Send reinforcements we are going to advance" via a runner who passed on the message by telephone and eventually a telegram was received at HQ. A confused general received the message "Send three and fourpence we are going to a dance" and duly sent the correct change.

EXCEL is a pretty decent data store (with a c***p API) just parse it when you need to. Using an intermediate data store will only introduce bugs and its doubtful that the extra IO involved in writing to a database would lead to a performance improvement.

If your idea is that by storing data in a database in case an error occurs you don't need to parse it again: Errors should be rare, they are a big hit anyway, so a little bit of time for parsing wouldn't matter. On the other hand, you must now make sure that the database file is still there, that it hasn't been modified or overwritten, you must delete it when it's not needed anymore, ..., you add all kinds of useless work that you have to code, that you have to test, that you have to get right. Especially since this all has to work correctly if an error occurs, which is hard to test, and difficult - because there has just been an error.

You add a huge lot of additional work for a rare case, where nobody cares about the performance, and it isn't that likely that you actually gain any performance, because parsing Excel files isn't that slow, and databases aren't that fast.

There was an article about the evils of "premature optimisation"; that whole article assumed that there is actually optimisation, which I somehow doubt. Has anyone actually complained that re-parsing the Excel file in case an error happened is too slow?

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange