encoding/decoding string and a special character to byte array

https://stackoverflow.com/questions/1873971

18-09-2019
|

Question

I had a requirement of encoding a 3 character string(always alphabets) into a 2 byte[] array of 2 integers. This was to be done to save space and performance reasons.

Now the requirement has changed a bit. The String will be of variable length. It will either be of length 3 (as it is above) or will be of length 4 and will have 1 special character at beginning. The special character is fixed i.e. if we choose @ it will always be @ and always at the beginning. So we are sure that if length of String is 3, it will have only alphabets and if length is 4, the first character will always be '@' followed by 3 alphabets

So I can use

charsAsNumbers[0] = (byte) (locationChars[0] - '@');

instead of

charsAsNumbers[0] = (byte) (chars[0] - 'A');

Can I still encode the 3 or 4 chars to 2 byte array and decode them back? If so, how?

Solution

Yes, it is possible to encode an extra bit of information while maintaining the previous encoding for 3 character values. But since your original encoding doesn't leave nice clean swaths of free numbers in the output set, mapping of the additional set of Strings introduced by adding that extra character cannot help but be a little discontinuous.

Accordingly, I think it would be hard to come up with mapping functions that handle these discontinuities without being both awkward and slow. I conclude that a table-based mapping is the only sane solution.

I was too lazy to re-engineer your mapping code, so I incorporated it into the table initialization code of mine; this also eliminates many opportunities for translation errors :) Your encode() method is what I call OldEncoder.encode().

I've run a small test program to verify that NewEncoder.encode() comes up with the same values as OldEncoder.encode(), and is in addition able to encode Strings with a leading 4th character. NewEncoder.encode() doesn't care what the character is, it goes by String length; for decode(), the character used can be defined using PREFIX_CHAR . I've also eyeball checked that the byte array values for prefixed Strings don't duplicate any of those for non-prefixed Strings; and finally, that encoded prefixed Strings can indeed be converted back to the same prefixed Strings.

package tequilaguy;


public class NewConverter {

   private static final String[] b2s = new String[0x10000];
   private static final int[] s2b = new int[0x10000];
   static { 
      createb2s();
      creates2b();
   }

   /**
    * Create the "byte to string" conversion table.
    */
   private static void createb2s() {
      // Fill 17576 elements of the array with b -> s equivalents.
      // index is the combined byte value of the old encode fn; 
      // value is the String (3 chars). 
      for (char a='A'; a<='Z'; a++) {
         for (char b='A'; b<='Z'; b++) {
            for (char c='A'; c<='Z'; c++) {
               String str = new String(new char[] { a, b, c});
               byte[] enc = OldConverter.encode(str);
               int index = ((enc[0] & 0xFF) << 8) | (enc[1] & 0xFF);
               b2s[index] = str;
               // int value = 676 * a + 26 * b + c - ((676 + 26 + 1) * 'A'); // 45695;
               // System.out.format("%s : %02X%02X = %04x / %04x %n", str, enc[0], enc[1], index, value);
            }
         }
      }
      // Fill 17576 elements of the array with b -> @s equivalents.
      // index is the next free (= not null) array index;
      // value = the String (@ + 3 chars)
      int freep = 0;
      for (char a='A'; a<='Z'; a++) {
         for (char b='A'; b<='Z'; b++) {
            for (char c='A'; c<='Z'; c++) {
               String str = "@" + new String(new char[] { a, b, c});
               while (b2s[freep] != null) freep++;
               b2s[freep] = str;
               // int value = 676 * a + 26 * b + c - ((676 + 26 + 1) * 'A') + (26 * 26 * 26);
               // System.out.format("%s : %02X%02X = %04x / %04x %n", str, 0, 0, freep, value);
            }
         }
      }
   }

   /**
    * Create the "string to byte" conversion table.
    * Done by inverting the "byte to string" table.
    */
   private static void creates2b() {
      for (int b=0; b<0x10000; b++) {
         String s = b2s[b];
         if (s != null) {
            int sval;
            if (s.length() == 3) {
               sval = 676 * s.charAt(0) + 26 * s.charAt(1) + s.charAt(2) - ((676 + 26 + 1) * 'A');
            } else {
               sval = 676 * s.charAt(1) + 26 * s.charAt(2) + s.charAt(3) - ((676 + 26 + 1) * 'A') + (26 * 26 * 26);
            }
            s2b[sval] = b;
         }
      }
   }

   public static byte[] encode(String str) {
      int sval;
      if (str.length() == 3) {
         sval = 676 * str.charAt(0) + 26 * str.charAt(1) + str.charAt(2) - ((676 + 26 + 1) * 'A');
      } else {
         sval = 676 * str.charAt(1) + 26 * str.charAt(2) + str.charAt(3) - ((676 + 26 + 1) * 'A') + (26 * 26 * 26);
      }
      int bval = s2b[sval];
      return new byte[] { (byte) (bval >> 8), (byte) (bval & 0xFF) };
   }

   public static String decode(byte[] b) {
      int bval = ((b[0] & 0xFF) << 8) | (b[1] & 0xFF);
      return b2s[bval];
   }

}

I've left a few intricate constant expressions in the code, especially the powers-of-26 stuff. The code looks horribly mysterious otherwise. You can leave those as they are without losing performance, as the compiler folds them up like Kleenexes.

Update:

As the horror of X-mas approaches, I'll be on the road for a while. I hope you'll find this answer and code in time to make good use of it. In support of which effort I'll throw in my little test program. It doesn't directly check stuff but prints out the results of conversions in all significant ways and allows you to check them by eye and hand. I fiddled with my code (small tweaks once I got the basic idea down) until everything looked OK there. You may want to test more mechanically and exhaustively.

package tequilaguy;

public class ConverterHarness {

//   private static void runOldEncoder() {
//      for (char a='A'; a<='Z'; a++) {
//         for (char b='A'; b<='Z'; b++) {
//            for (char c='A'; c<='Z'; c++) {
//               String str = new String(new char[] { a, b, c});
//               byte[] enc = OldConverter.encode(str);
//               System.out.format("%s : %02X%02X%n", str, enc[0], enc[1]);
//            }
//         }
//      }
//   }

   private static void testNewConverter() {
      for (char a='A'; a<='Z'; a++) {
         for (char b='A'; b<='Z'; b++) {
            for (char c='A'; c<='Z'; c++) {
               String str = new String(new char[] { a, b, c});
               byte[] oldEnc = OldConverter.encode(str);
               byte[] newEnc = NewConverter.encode(str);
               byte[] newEnc2 = NewConverter.encode("@" + str);
               System.out.format("%s : %02X%02X %02X%02X %02X%02X %s %s %n", 
                     str, oldEnc[0], oldEnc[1], newEnc[0], newEnc[1], newEnc2[0], newEnc2[1],
                     NewConverter.decode(newEnc), NewConverter.decode(newEnc2));
            }
         }
      }
   }
   public static void main(String[] args) {
      testNewConverter();
   }

}

OTHER TIPS

~~Not directly an answer, but~~ here's how I would do the encoding:

   public static byte[] encode(String s) {
      int code = s.charAt(0) - 'A' + (32 * (s.charAt(1) - 'A' + 32 * (s.charAt(2) - 'A')));
      byte[] encoded = { (byte) ((code >>> 8) & 255), (byte) (code & 255) };
      return encoded;
   }

The first line uses Horner's Schema to arithmetically assemble 5 bits of each character into an integer. It will fail horribly if any of your input chars fall outside the range [A-`].

The second line assembles a 2 byte array from the leading and trailing byte of the integer.

Decoding could be done in a similar manner, with the steps reversed.

UPDATE with the code (putting my foot where my mouth is, or something like that):

public class TequilaGuy {

   public static final char SPECIAL_CHAR = '@';

   public static byte[] encode(String s) {
      int special = (s.length() == 4) ? 1 : 0;
      int code = s.charAt(2 + special) - 'A' + (32 * (s.charAt(1 + special) - 'A' + 32 * (s.charAt(0 + special) - 'A' + 32 * special)));
      byte[] encoded = { (byte) ((code >>> 8) & 255), (byte) (code & 255) };
      return encoded;
   }

   public static String decode(byte[] b) {
      int code = 256 * ((b[0] < 0) ? (b[0] + 256) : b[0]) + ((b[1] < 0) ? (b[1] + 256) : b[1]);
      int special = (code >= 0x8000) ? 1 : 0;
      char[] chrs = { SPECIAL_CHAR, '\0', '\0', '\0' };
      for (int ptr=3; ptr>0; ptr--) {
         chrs[ptr] = (char) ('A' + (code & 31));
         code >>>= 5;
      }
      return (special == 1) ? String.valueOf(chrs) : String.valueOf(chrs, 1, 3);
   }

   public static void testEncode() {
      for (int spcl=0; spcl<2; spcl++) {
         for (char c1='A'; c1<='Z'; c1++) {
            for (char c2='A'; c2<='Z'; c2++) {
               for (char c3='A'; c3<='Z'; c3++) {
                  String s = ((spcl == 0) ? "" : String.valueOf(SPECIAL_CHAR)) + c1 + c2 + c3;
                  byte[] cod = encode(s);
                  String dec = decode(cod);
                  System.out.format("%4s : %02X%02X : %s\n", s, cod[0], cod[1], dec);
               }
            }
         }
      }
   }

   public static void main(String[] args) {
      testEncode();
   }

}

In your alphabet, you use only 15 of the 16 available bits of the output. So you could just set the MSB (most significant bit) if the string is of length 4 since the special char is fixed.

The other option is to use a translation table. Just create a String with all valid characters:

String valid = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ";

The index of a character in this string is the encoding in the output. Now create two arrays:

byte encode[] = new byte[256];
char decode[] = new char[valid.length ()];
for (int i=0; i<valid.length(); i++) {
    char c = valid.charAt(i);
    encode[c] = i;
    decode[i] = c;
}

Now you can lookup the values for each direction in the arrays and add any character you like in any order.

You would find this a lot easier if you just used the java.nio.charset.CharsetEncoder class to convert your characters to bytes. It would even work for characters other than ASCII. Even String.getBytes would be a lot less code to the same basic effect.

If the "special char" is fixed and you're always aware that a 4 character String begins with this special char, then the char itself provides no useful information.

If the String is 3 characters in length, then do what you did before; if it's 4 characters, run the old algorithm on the String's substring starting with the 2nd character.

Am I thinking too simply or are you thinking too hard?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow