Pergunta

Let's say we have a generic table like below:

id, name, price, quantity
20 product_x 5,00 100
20 product_y 5,00 100
20 product_z 5,00 100
20 product_a 5,00 100

For the name field we more than likely have repeating string variables. Doing grouping and comparisons on especially large datasets with repeating strings like this feels like would be less expensive if all the names were dictionary coded into let's say an int16. If that was the case this practice would be common but it is not. What are the reasons for that? What am I missing

Foi útil?

Solução

Let us start with your "products" as a representative example: OLAP databases which involve products are typically used to create queries like

  • which are the top 10 selling products over the last 3 months?

  • which 10 products brought us the most revenue over the last year?

  • in which locations did we have tho most shortage in delivery of certain products?

All these queries are expected to display product names to the user / analyst. Those users certainly cannot deal directly with things like product IDs or hash codes.

Another important thing about OLAP cubes is, they are typically created from a frozen snapshot of an OLTP database, which means product names don't change during the life time of the OLAP data set. So there is typically no requirement to maintain the product names in a central place and to keep them in a normalized fashion. When the product names change in between in the OLTP database, the OLAP cube will get them the next time a new snapshot is taken, which means the whole dataset is rebuild from scratch.

So the most simple design of an OLAP table for typical use cases will often be to use directly the names of the objects as keys. Introducing another Product table with a product ID and refernce that ID is a more complex design which will probably not lead to speed improvements or simpler queries.

Encoding the names the way you suggested using an in-memory dictionary can be seen as an optimization, either optimizing for space (which is often cheap today), maybe optimizing for speed (which is often misjudged because people forget to measure).

But as any optimization, it comes for a cost: the system becomes more complicated, because, if strings are encoded first, useful queries will then require an additional decoding step to present the results in a form the users can handle.

Moreover, often such kinds of optimizations are not worth the hassle, because

  • there is simply no optimization required (because the simple design is quick enough as it is, for the given requirements)

  • after implementing and measuring, they turn out not to return any significant savings (or make things worse)

  • the underlying database technology does already such optimizations "under the hood" (like de-duplicating strings)

So I suggest to start thinking about encoding strings as 16-bit IDs

  • when you have a real performance or storage problem to solve

  • when you can make an educated guess (or better - some measurements) that the duplication of those strings is the root cause of the problem, and that deduplicating could help.

Don't optimize "just in case" or because you "feel" something could be less expensive, those feelings tend to mislead people.

Licenciado em: CC-BY-SA com atribuição
scroll top