Question

Lets say you have 20 options you want an end user to choose from. Based on those choices, calculations will be done on them to create a percentage. What would make better sense in terms of query performance when running query calculations?

a) Build one column and have all the selected values comma separated within that column.
b) Have each option as its own column in the table

c) Have a linking table that marks the choices back to the users ID?

If this should be asked somewhere else, let me know and I will delete this and move it to the appropriate location.

No correct solution

OTHER TIPS

The key here is that you are doing calculations against this data.

There are a variety of reasons why you should not overload a column by storing a CSV of multiple data points in it, but the controlling reason in this case is performance.

If you are basing calculations in the database on the values of these data points, you will absolutely want to utilize separate columns for them. If you store them as a single CSV string, then every calculation will incur the cost of unpacking the CSV and converting the characters to numeric values. The first operation is particularly expensive, in every DBMS I have ever used. It's the sort of thing that makes DBAs curse and throw things at the wall.

If, on the other hand, the data points are already stored as numeric values in independent columns, those overhead costs are not incurred and you can simply do basic math to calculate a numeric result from them.

Take it from a DBA: The cost of storing separate columns (or even a separate CustomerChoices table) is not zero, but it is still far less than the cost incurred by dealing with unpacking overloaded columns and doing data type conversions.

(In addition, when you store CSV data you are storing unnecessary bytes that are not even data: all of the commas. That too has a cost.)

The two prior answers may have invoked query performance in support of the CSV solution, but they have overlooked the key point that the data must be employed in numerical calculations.

If you could do the number crunching in your application and store only the computed percentage and the choices to your DB store, then you could maybe get away with storing the choices as a CSV. But only when a DBA isn't looking.

Even in that case, any DBA worth their salt will argue for separate numeric columns anyway, because at some time in the future there will be a request to the DBA to reproduce the application's calculation in the database for an auditing query, and they will be out of things to throw at the wall because they already smashed all their favorite things.

a) probably violates the principle of atomicity and therefore the 1NF, leading to multitude of problems.

The choice between b) and c) depends on how dynamic the values should be:

  • c) is more appropriate if you want to be able to dynamically "grow" the set of permitted values,
  • while b) may be more straightforward (and performant) for the static case.

--- UPDATE ---

OK, let me touch the query performance aspect a bit.

By using b) or c), you can do AVG, SUM and other aggregate functions directly in the DBMS, then return the result alone. That conserves network bandwidth, which tends to be scarcer resource than CPU.

In case of a), you'd probably implement it by fetching all the "fat" data to the client and doing the calculations there (unpacking and aggregating data on the server would require some acrobatics - not impossible but certainly more fragile and less maintainable). And as Jonathan Van Matre rightfully pointed out, serializing into a string, as opposed to using the native binary storage format, is not exactly the most compact way of representing the data to begin with.

It all depends and I'm assuming we are talking about a sql data store.

If you need to be able to quickly query on things like which users have selected which option then then you'll either want separate columns in your table or another table with, say three columns.

userId
optionId
optionValue

optionId would identify the option ('option1', 'option2', etc.). This approach has the advantage of allowing you to add option types without modifying your database schema. This might be an advantage depending on your environment (like if adding a database column requires a lengthy change control process).

Likely the separate table approach would be somewhat slower, but I don't think this is a "difference that makes a difference". I would almost always use the representation that was the easiest to work with for my application logic,

If you really are interested in only the final percentage and never need to query on the options, then I think a comma separated list of strings is probably fine. a DBA might disagree with that opinion :).

In relation to query performance I would suggest you build a column and have all the selected values comma separated within that column

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top