Question

I am about to use an algorithm to encode a variable length but very long String field retrieved from an XML file, then that encoded data should be persisted in the database.

Later, when I recieve a second file I need to fetch the encoded data from database (previously stored) and then decode it and validate with the new data for duplicate.

I tried org.apache.commons.codec.binary.Base64 class it has 2 methods:

  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)

which works perfectly fine and solves my problem. But it converts 55 char string to just 6 char String.

So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

I donot know about the Base64 class much but if anyone can help me out it will be really helpful.

If you can suggest any other Algorithm which makes a large String short of fixed length and solves my purpose I will be happy to use it.

Thanks in advance.

Was it helpful?

Solution

Not very efficient.

Also, using sun.misc classes gives a non-portable application.

Check out the following performance comparisons from MiGBase64:

enter image description here


So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. Base64 is designed to represent arbitrary binary data in an ASCII string. Encoding a Unicode string as Base64 will often increase the number of code points required since the Unicode character set requires multiple bytes. The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. For example:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

Solution 1

Use lossless compression

GZip( UTF8( "test" ) )

Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best.

Pros: no collisions, ability to recover original string
Cons: Bytes required to store value is variable; bytes required to store value is larger

Solution 2

Use a hashing algorithm

SHA256( UTF8( "test" ) )

Here you are converting the string to a fixed length set of bytes with a hashing function. Hashing is uni-directional and by its nature collisions can be possible. However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions

Pros: Bytes required to store value is fixed; bytes required to store value is small
Cons: Collisions possible, no ability to recover original string

OTHER TIPS

I just saw your comment - it seems you're actually looking for compression rather than hashing as I initially thought. Though in that case, you won't be able to get fixed length output for arbitrary input (think about it, an infinite number of inputs cannot map bijectively to a finite number of outputs), so I hope that wasn't a strong requirement.

In any case, the performance of your chosen compression algorithm will depend on the characteristics of the input text. In the absence of further information, DEFLATE compression (as used by the Zip input streams, IIRC) is a good general-purpose algorithm to start with, and at least use as a basis for comparison. For ease of implementation, though, you can use the Deflator class built into the JDK, which uses ZLib compression.

If your input strings have particular patterns, then different compression algorithms may be more or less efficient. In one respect it doesn't matter which one you use, if you don't intend the compressed data to be read by any other processes - so long as you can compress and decompress yourself, it'll be transparent to your clients.

These other questions may be of interest:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top