Efficient compression and representation of key value pairs to be read from 1D barcodes

Question 1

After a lot of playing and fiddling around, we finally choose this approach:

1. Encode settings into a byte stream

Field values are serialized into a byte stream, with a header for each field. The header consumes one byte and contains the ID of the field and some flags that help to reduce the amount of data to transport. Depending on the type of the field (e.g. a string, a number or an IP address), the value is efficiently encoded into the byte stream. For example, an IP address is encoded with 4 bytes, whereas a boolean flag is directly encoded into the field header. This way, we're capable to encode even SSL certificates into the stream, if required. As the typical barcode formats are not able to transport arbitrary byte values, we need to encode the byte stream in the next step.

2. Convert to barcode format

The resulting byte array is now treated as a big, big integer and converted into the target barcode format using a base encoding and a charset (see this question). This way, we efficiently use the barcode format to transport our data (in contrast to Base64 or other encodings). From the resulting string, we can chunk of single barcodes and add some additional header information to them (e.g. how many barcodes have to be scanned? is the data encrypted? ...).

When the barcodes get scanned on a mobile device, the encoded string can be restored and converted into the same, big integer. This integer can then be treated as a byte array, that can be parsed when the field serialization format is known.

This approach turned out to be very efficient and fast (we had some concerns regarding the BigInteger implementation on CF).

Question 2

One simple method would be to define all 64 characters directly mapped to code128. this would leave 30-40 available code 128 slots. In the remaining slots define some double characters. == =& 0= 1= 2= 3= 4= 5= 6= 7= 8= 9= &0 &1 &2 &2 &5 &5 &6 &7 &8 &9 (repeat last character)= =(double next character) &(double next character)

Question 3

While some barcode formats have a fixed set of characters they can represent and use the same amount of space to hold each character, others either use multiple character sets, or use variable amounts of space to hold each character. For example, "classic" code 39 defines 43 characters, each represented by one of 43 symbols, and simply can't represent anything other characters, but there's another code-39 variant which represents 39 common characters using one symbol, and other characters using two-character sequence. Suppose, for example, one wanted to store a bunch of binary data in a code-39 barcode. If one converted the data to base-64 format, the four characters associated with three octets of raw data would likely take an average of an average of about 5.69 symbols to store [about 27 of the 64 characters used in base64 take two symbols to store in code39]. If one instead chose 32 characters that can be represented by one symbol each, one could store 24 (or 25) bits using five octets to store five bits each [a consistent 1.67 symbols per octet, versus an average of 1.89 and worst case of 2.67]. If one were using "classic" code 39 (which can represent 43 characters using one symbol each), one could even store four octets in six symbols [an average of 1.5 symbols per octet].

Different barcode formats are "optimized" for different character sets; some like Code 128 have multiple character sets, and may be used efficiently with data that uses the full range of one character set, while avoiding using characters outside it. I don't know of any particular recommended approaches for reformatting data so as to optimize the use of a particular symbology's character sets, but examining the encoding used by a symbology and your particular requirements should help you figure out what encoding will work best for your application.