How to minimize size of data given the need for a null pointer?

https://softwareengineering.stackexchange.com/questions/411183

12-03-2021
|

Pergunta

So I don't know how to formulate this clearly, as it's quite confusing. Hence the reason for asking this question, to overcome the confusion.

I will do my best to make it clear, but part of the reason I ask is because I don't know how to make it clear (I've been thinking about this on and off for a while):

If you store information in 8-bit (1-byte) chunks, with a special "null terminator" byte, then you have 255 possible values each "chunk" (each byte).

If you instead store information in 2-bit chunks (as small as possible), with a special "null terminator" 2-bit chunk (let's say it is 00), then you have 3 possible values each "chunk".

So in the 8-bit scenario, if you have 4 chunks, you have 255 * 4 = 1020 possible combinations, in 32 bits.

In the 2-bit scenario, to get to 1020 combinations, you need 340 chunks, which is 680 bits. So to store the same amount of information in the 2-bit scenario requires roughly 28 times the space as the 8-bit scenario.

Meanwhile, if you had a 10-bit scenario, then you could store slightly more than 1020 combinations in only 20 bits (1 10 bit, plus a 10 bit null terminator). So this would be better than 8-bits for 1020bits worth of data.

So it goes (1020/((2^n)-1)):

2 ~ 680
3 ~ 483
4 ~ 272
5 ~ 165
6 ~ 98
7 ~ 57
8 ~ 32
9 ~ 18
10 ~ 10

using:

function x(n) {
  return Math.ceil((1020/(Math.pow(2,n)-1))*n)
}

Or at a larger scale reaching for 100000 bits using that equation:

8 ~ 3138
9 ~ 1762
10 ~ 978
11 ~ 538
12 ~ 294
13 ~ 159
14 ~ 86
15 ~ 46
16 ~ 25
17 ~ 13

using:

function x(n) {
  return Math.ceil((100000/(Math.pow(2,n)-1))*n)
}

It looks like it's basically exponential.

The question is, how do you determine what is the best representation of your data? It seems like it is based on the data length. So if you have 1 bit of information, then 2 bits is plenty. But if you have 1TB of data, then ? Is the right approach to maximize smallest file size (remembering the need for the null byte).

I don't have any idea how to get this math right, it is causing me to loop and loop. What I'd like to know is if there is some equation to determine what the "chunk" size should be for a given amount of data, so that the null terminator is the same size as each of these chunks, yet the overall size of the data is minimal.

Solução

Firstly, your maths is incorrect. Let's fix that:

So in the 8-bit scenario, if you have 4 chunks, you have 255 * 4 = 1020 possible combinations, in 32 bits.

You want to count every combination of four bytes, which is 255⁴ = roughly four billion combinations, which is strikingly close to 2³², which is what you'd expect. In fact, 2³² - 255⁴ = 66,716,671‬, so you've spent sixty-six million possible combinations to reserve one magic value in each of the four bytes.

In the 2-bit scenario, to get to [the same number of] combinations, you need

We're actually looking for log₃ 255⁴, which is just over 20.

The question is, how do you determine what is the best representation of your data?

Well, firstly we decide whether we need a null terminator at all. If we know the whole range of a number will fit in a fixed-width integer type, we can just use that. It's faster and simpler to process, and often absolutely fine.

Secondly, remember we can often just use external compression. Instead of having one complex module with difficult-to-decode variable-length data, you use two simpler modules - one that acts on simple data, and one that (de)-compresses arbitrary streams or blocks of data. This may give better compression than trying to micro-optimize the storage of each field, and be less effort.

If you must use variable-length data, and you really have no way of guessing whether each field will be one byte or one terabyte, then yes, it's difficult to come up with a single solution which is optimal in all cases. But realistically this is almost unheard of.

Outras dicas

You're apparently working on the assumption that zero-terminating is the only (or maybe just natural) way of representing variable-length data. This just isn't the case, it just happens to be the method chosen for strings in the C language (and maybe its predecessor.)

Other possible representations work with a count of the number of elements which means constant overhead but an upper limit on the number of elements, or with a tree data structure which incurs higher overhead per element but enables operations that are not easily possible with arrays.

Which data representation to choose in memory highly depends on the operations and algorithms that should be supported. Space optimization is only necessary for compressed storage and transmission where information density plays a significant role, or for algorithms that work on huge amounts of data that would cause paging when data is represented inefficiently.

Generally, micro-optimization of memory consumption doesn't make sense. For a deeper understanding of the possibilities and limits of information storage, having a look at https://en.m.wikipedia.org/wiki/Information_theory is probably a good idea.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange