Is denormalization more efficient in this case?

https://stackoverflow.com/questions/10773335

11-06-2021
|

Question

For more clarification please look my for older post here Database normalization - who's right?

I appreciate those nice answers, but I want to emphasize that this system that we are making is not just for the purpose of learning. It is a true enrollment system for our school. We started about four months ago and the system went alright except for the confusion about denormalization.

Like I said, his reasons are: 1. The queries may be deleted accidentally, thus making more problems.

He said 2nd normal form is enough, as he did in all his systems from his past experience.
People (with not enough technical knowledge) working with us cannot make queries they wanted out of from tables with not enough attributes/rows.(in my case, I decided to remove the total units because that can be easily computed from the other attributes.)
Other systems such as accounting, payroll, inventory and purchasing are planned to integrate with the enrollment system. If that is the case, he said, it is good to connect each new systems database to our enrollment system database directly, without accessing the queries.
He argued that all dependent rows, such as the computed average grades of each student must also be included in tables, because what we need, he said, is the physical data, not to be recomputed by means of views.
More importantly, I guess, is that he wanted every transactions to be entered in the database. As in the case of debit and credit for the purpose of balancing transactions.

In my case, from what Ive heard from him, he doesn't mention anything about the speed except making queries fom queries(which I believe the main reason why we need to denormalize). He simply wants everything documented in the database.

My position is in the contrary of all of these. If accuracy is our concern over the speed, Normalization is perfect, by the way we are using microsoft sql server.

One last thing, I remembered he wanted to include the full_name column in the students_info table. His reason? He said "It is better to read from table than making another query. Just make sure the program can control the user input for the full_name".

Before I decide to discontinue making this system, please let me know from you, more experienced people.

Solution

The queries may be deleted accidentally, thus making more problems.

This is what version control software is for. Also, if you can "accidentally" delete a view, you can probably accidentally delete a table.

He said 2nd normal form is enough, as he did in all his systems from his past experience.

Then he doesn't have enough experience. Particularly in accounting.

I'm famous (or notorious) for insisting that subordinates give me performant designs in 5NF. If they can't do that, they probably either a) don't know what 5NF is, or b) think every row should have a ID number in it. (Having an id number in every row increases the number of joins needed, often contributes to poor performance, and has nothing to do with normalization.) Both those are good opportunities for education.

BCNF might be good enough. 2NF usually isn't.

If you lose this battle, insist on CHECK() constraints to make sure the total is always correct.

People (with not enough technical knowledge) working with us cannot make queries they wanted out of from tables with not enough attributes/rows.

Adding some views will help you in the short term. You might need to add some updatable views. But you have the right to insist on a certain level of technical knowledge from people who are going to be working with accounting data in a production-grade enrollment system.

Other systems such as accounting, payroll, inventory and purchasing are planned to integrate with the enrollment system. If that is the case, he said, it is good to connect each new systems database to our enrollment system database directly, without accessing the queries.

Views (queries) and tables share a single namespace. Client code doesn't say "I want to connect to a table, not a view, and it must be named 'student_payments'." Client code just says, "Connect to 'student_payments.'"

That said, anyone who has permission to insert into a table of payments better know how to insert correctly into a table of payments. If you end up having to include a column that's the result of a calculation on other columns, insist on a CHECK() constraint.

There are systems designed in such a way that all client access is through stored procedures, and client code has no direct access to tables. This approach makes a lot of sense when valid transactions must insert into many tables at once.

He argued that all dependent rows, such as the computed average grades of each student must also be included in tables, because what we need, he said, is the physical data, not to be recomputed by means of views.

What you need is for the database to always give you the right answer.

More importantly, I guess, is that he wanted every transactions to be entered in the database. As in the case of debit and credit for the purpose of balancing transactions.

Finally, something sensible. Financial transactions are generally only inserted. If they're incorrect, they're not updated or deleted. Instead, you insert a compensating transaction. (And, I hope, the reason for it.)

As a practical matter, I would not include calculated columns in the first release. I'd add them only if their absence created an actual performance problem.

Having said that, I have a pretty high bar for identifying actual performance problems. If Vinny Vice-president has to wait five seconds for a query to return, that's not an actual performance problem. If a query that takes five seconds is blocking other queries and degrading overall performance every day, that's an actual performance problem.

Don't base your determination of performance problems on the behavior of one single SELECT statement. Your determination of a performance problem should ideally be based on the behavior of the whole system. Practically, it's based on the behavior of a representative sample of SQL statements. Pick a representative selection of SELECT, INSERT, and DELETE statements before you have a performance problem. Test them with representative sample data, and store the timings at the very least. Ideally, store their execution plans and timings.

I would not include calculated columns solely for the sake of having "real" data in the table.

If I had to solve an actual performance problem by storing the result of a calculation, I would not release it without first doing at least one of these things.

If the constraint required a calculation on a single row, I'd include a CHECK() constraint to guarantee the calculated value is always correct.
If the constraint required calculations over multiple rows, I'd include an assertion or trigger to implement the constraint. I'd also carefully review the dbms documentation, looking for instances where, say, triggers might not fire. (On some platforms, triggers don't fire during a bulk load.)
If I couldn't use CHECK() constraints, assertions, or triggers, I'd implement some kind of administrative procedure, preferably coded in a stored procedure or its equivalent, to periodically search for data where the actual total didn't match the expected total. If I couldn't implement that in a SP, I'd do it in application code running under a cron job. There are many ways to do that without materially affecting other processes.

Often, I'll implement a periodic administrative procedure to check for missing or miscalculated data even if I also use a declared constraint. Anyone having sufficient privileges can drop or disable a constraint for good reasons, bad reasons, or no reasons at all. (People who have high privileges--including yourself--are your most dangerous users.)

OTHER TIPS

If you are creating a database where all the data can be updated, then normalization is the right approach. You want to be sure that when a data item changes, then the results are propagated everywhere. You may not need the esoteric reaches of normalization (for instance, two-character state codes can be fine if you know all the addresses are in the US).

To solve the problem of "queries being deleted" and other issues, use views. These allow you to connect the reporting view of the data to the underlying data structure. After all, what is best for keeping the data consistent, may not be best for reporting.

Ultimately, in my experience, you will be heading toward a data mart solution. You will have the underlying data in a normalized form for the operational applications. You'll have another set of tables, derived from these, used for reporting purposes. These tables will be denormalized, redundant, and look different for different groups -- some may be accessed over the web, some through Excel, some may feed other applications (such as budget forecasting). Before you get there, however, views should probably work quite well for meeting query needs.

Yes, if you are creating a data warehouse. Instead of normalizing and having hundreds or even thousands of tables. You can denormalize and have fewer tables. Less join. It will be better optimize because fewer people will be querying the warehouse.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow