Reduce number of digits by converting to alphanumeric data

https://softwareengineering.stackexchange.com/questions/396997

01-03-2021
|

Question

We have an app that receives a web service request, processes it and sends it back to our client by another web service call. There is a unique field in the request, a tracking Id, which currently follow the pattern [A-Z][A-Z][-][0-9][0-9][0-9][0-9][0-9][0-9]. It is better explainable with a sample in that format, let's say we have the tracking id "AB-123456" where "AB" is always alphanumeric, "123456" is currently numeric, and "-" is the separator. We have a length and format restriction in place on our client-side.

Now, we want to update the format to "AB-123456-1" to meet a new business requirement (There is a scenario in which we can map multiple products with same tracking Id, and hence we need to trigger multiple web service request to the client. Since the tracking Id is unique, we modified its format and introduced a different structure), but the client team doesn't agree with this since it will break length restriction checks at their database. But they are fine to use alphanumerals. One of the solutions we proposed is to use the base 16 encoding (convert that number ("123456") to hex (we will get "1E240")), so the final representation of AB-123456-1 will become AB-1E2401. We would like to check whether this is the right approach and whether there is an alternate solution available for this type of problem.

The hex representation stays invisible to the customers. It is an indirect customer transaction and is not rendered in UI (it is a web service call). Our client uses this field (specifically the numeric field) in their databases as a unique field. As our client saves it as a string in the database, converting to hex is ok for them, but it is more work for them if we are increasing the field length. In our end, we just receive the data, process it and send to the client. Our client uses - as the separator and the 123456 as the unique field in the database. We are fine to have the additional suffix limited to single-digit as long as it is easily identifiable programmatically. There is a meaning behind the suffix, it indicates a separate product which belongs to same tracking Id (Other information like cost of the product will differ in the request we send them).

Solution

When this really solves the problem, why not? Sometimes, a pragmatic solution with some "duct tape" is required to get things done, even if it does not look beautiful from a design perspective.

However, if the other party has already stored these keys using decimal digits in their database, does the new encoding not collide with the existing keys? If you send them a number "123456" as a hex number, and they have already used this decimal as a key in their db, doesn't that lead to conflicts when they now try to store different record with exactly this key again?

In case the other party can handle this, go ahead. If they can change the encoding now, they can probably change it later to something which is even more compact like "base 64", in case "base 16" won't be sufficient any more. And maybe that gives them enough time for making their system a little bit more evolvable, so they can handle trivial field length extensions in a more sane way.

OTHER TIPS

This is precisely why storing business defined identifiers as database identifiers becomes problematic. The business changes their rules, and this forces downstream changes in databases.

If you were to rewind time using a UUID (or GUID in .NET terminology) would be an ideal identifier for the same data between two systems, and keep the "business Id" as a separate field. Clearly you cannot do this right now, but you can do this half way.

You can store two Ids for each record. One is the "business friendly" Id, and the other could be your hex-encoded Id you send to the client service. This works OK if your system isn't using the business friendly Id as foreign key values in your own tables.

This way the client service gets what they want, and you can still keep a more technical friendly Id for integrating services.

I would even go a step further and add a UUID Id column internally, and reference that as a foreign key in conjunction with the business friendly Id and the client service friendly Id. This will set you up for a future where the UUID can be passed between services, which is ultimately tied to the business friendly Id. The business friendly Id should be stored in just 1 table with the UUID being the primary key and foreign key values within your system, and being the identifier used by outside systems.

Encode the number as base 64 using a fixed size representation and restrict one of the characters to be non-numeric to prevent collisions with existing identifiers. Base 64 uses 'A'-'Z' to represent 0-25, 'a'-'z' to represent 26-51, '0'-'9' to represent 52-61, '+' and '/' to represent 62 and 63.

For example use 4 characters for the number and 2 for the suffix, with the first character for the suffix in the [A-Za-z] range. That allows 64^4 ~ 16.7 millions of values for the number and 52x64 = 3328 values for the suffix.

123456-1 (why not start the suffix at 0?) becomes the indexes [0][30][9][0]-[0][1] in the base 64 table as 123456=0x64^3+30x64^2+9x64^1+0x64^0, with the last 2 values representing the suffix, so the result is AeJAAB.

If you are really sure that you will never have more than 52 values (or 54 if you use the '+' and '/') for the suffix you can use 5 characters for the number so you have ~ 1071 millions of possible values.

Unclear elements

It would be useful to clarify the following points:

Is the - currently stored in the string ?
Is the length control on the client side based on the visible string or on the length of the data sent to the backend ?
Is the unique identifier a concatenation of two values (i.e. could there be AB-123456 and CD-123456 ? Or is 123456 the real identifier and AB is just an additional info ?)
Is the additional suffix -1 limited to a single digit ?
Is the additional suffix an arbitrary extension of the id length (i.e. the separator is just here to facilitate reading) ? Or is there a meaning behind this suffix (e.g. would there a a conceptual relation between AB-123456-1 and AB-123456-2) ?

Suggestions

If answer to (1) is true, then the easiest approach would be to get rid of the separator:

AB-123456
AB1234561  instead of AB-123456-1

If answer to (2) is visible, then you have a reason more to get rid of the separator. But really, the front-end team should be more flexible : in the 80's it was a big deal to make a field longer, but in the XXIst century, really ?

If answer to (2) is backend, then you have a reason to get rid of the separator in the DB, but adatpt the display to insert the - where the user expect them.

In other cases, of course, going hexadecimal is a way to compress the central number. But you gain only one char, since you'll still need 5 hex digits to represent 999999. If you go for a base 32 instead of a base 16, you could spare 2 chars, encoding your number on 4 digits.

Note that you cannot find an encoding scheme that is smaller than 4 printable characters. With a base 95 (using as digit all the printable chars of the ascii character set), you'd still need 4 chars. If non printable chars would be tolerated, you could compress the number on 3 chars (since 20 bits are needed to encode a 6 decimal digit number).

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange