Compression algorithms for numbers only

https://stackoverflow.com/questions/879069

22-08-2019
|

Question

I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.

Now these numbers are in fixed format string.

What are the algorithms for compressing numbers in fixed format? Is number only compressions (if there any) better than string compression? Should I directly compress string without converting it to numbers and then compress?

Thanks in advance.

Solution

This is one of these places where a little theory is helpful. You need to think about several things:

what is the resolution of your measurements: 0.1° or 0.001°? 1 second or one microsecond?
are the measurements associated and in some order, or tossed together randomly?

Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.

Similarly with date time, what's the range; how many bits must you have?

Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.

The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.

Update: Oh, wait, fixed point strings?!

Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have

040.00105.0020090518212100Z

which you could convert to

|  4000            | short int, 16 bits |  
| 10500            | short int, 16 bits |  
| 20090518212100Z  | 64 bits            |

So that's 96 bits, 12 bytes versus 26 bytes.

OTHER TIPS

Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).

Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.

I recommend you have a look at The Data Compression Book

What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.

Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.

In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.

An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.

That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.

It depends on what you are going to do with the data, and how much precision you need.

Lat/long is traditionally given in degrees, minutes, and seconds, with 60 seconds to the minute, 60 minutes to the degree,and 1 degree of latitude nominally equal to 60 nautical miles (nmi). 1 minute is then 1 nmi, and 1 second is just over 100 ft.

Latitude goes from -90 to +90 degrees. Representing latitude as integer seconds gives you a range of -324000..+324000, or about 20 bits. Longitude goes -180 to +180, so representing longitude the same way requires 1 more bit.

So you can represent a complete lat/long position, to +/- 50 ft, in 41 bits.

Obviously, if you don't need that much precision, you can back down your bit count.

Observe that a traditional single-precision 32-bit float uses about 24 bits of mantissa, so you are down to about +/- 6 feet if you just convert your lat/long in seconds to float. It is kind of hard to beat two single-precision floats for this kind of thing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow