What coding practices are most useful dealing with trailing whitespace in data fields in binary files?

https://softwareengineering.stackexchange.com/questions/336196

01-01-2021
|

Pergunta

We have an application that consists binary files (containing a mix of text and numeric information) and programs written in various languages that create, modify and read these binary files. Because the text fields are stored in a binary file, an individual text field is stored in a binary file as a fixed number of bytes at a specified offset in that binary file. (If a text field has fewer characters, the remaining bytes are set to 0.)

There are numerous developers (in different internal organizations) who are involve in maintaining and adding new features to the all the programs that constitute this application.

One of the chronic issues we deal with is code that string comparisons often fail because the stored field may or may not have trailing whitespace, depending on how that specific field gets entered and written to the binary file, and the string in the code being compared does not. (In almost all cases, the trailing whitespace is not considered to be part of the text field value).

There are different approaches we can take to dealing with this situation:

Adopt a convention that all fields stored in a binary file must be trimmed before being written.
Adopt a convention that when comparing text, the comparison should be performed with trimmed strings.
Do both (1) and (2).

Are there any measurable advantages to suggest that one of the above approaches is better than any of the other approaches. The overall goal is to reduce the number of instances of an application error caused by trailing whitespace.

Note -- Currently we are doing (2), but we recently found legacy code that was not ignoring the trailing whitespace.

Solução

With multiple teams and developers, you cannot expect everyone to switch to a new convention immediately and flawlessly. And you can obviously not expect every legacy app to be changed immediately according to a new convention. So the best thing is IMHO, whenever you have to develop a new application or change an existing application, follow the robustness principle:

Be conservative in what you do, be liberal in what you accept from others.

For your case - for every attribute where trailing spaces should be ignored - this clearly means option (3), do both:

(1) is "beeing conservative in what you do" - trim trailing spaces written as output so other applications taking those strings will work even if they are not prepared.
(2) is what "be liberal in what you accept from others" means. Whereever your application gets those strings from, expect them not to be trimmed beforehand.

Of course, doing (1) and (2) might seem to be superflous at a first glance, but in bigger systems with multiple components, robustness is a key factor to keep the system scalable.

Outras dicas

It all depends on the potential relevance of trailing whitespaces in the application domain.

Could the trailing whitespaces be relevant data ?

If your software is for example a graphic editor, then trailing whitespaces could be relevant: the user could enter centered or right-aligned labels and add trailing spaces with intent (e.g. to obtain some alignment effects).

In this case, you should not remove trailing spaces on your own. So the current approach 2 would be the most advisable.

Or are trailing spaces irrelevant ?

In many business applications, the trailing spaces at the end of the field is not relevant (e.g. input habits of the user).

In this case, I'd advise to consistently trim the trailing spaces with option 1, because these spaces do in reality not belong to the data. It's not only comparison, but also formatting and combination, which could be jeopardized if keeping the trailers.

If you have not sufficient control on all the programs writing the data, you have to prepare yourself to cope with an unreliable world, and go for option 3 to make your code resilient. The same if for historical reasons you could not update legacy data (for example if some hash-codes or signature attest of their authenticity).

You should:

a) Have some form of abstract interface to read and write files, so that the code for reading and writing the file is contained in one place and not scattered everywhere

b) Have a "relatively formal" document that describes the file format; including specifying a specific character encoding (ASCII? UTF-8?), which characters are legal/illegal (bell? delete? vertical and horizontal tab?) and where they are legal/illegal.

c) Include a "version number" in the binary file; so that if the file format changes the version number changes. This allows you to (at worst) report an error if the file format isn't compatible with the current code; or (at best) allow the code that reads files to support multiple versions of the file format (for "future backward compatibility").

d) Assume that any user can deliberately tamper with the binary file (e.g. using a hex editor or something) or accidentally tamper with the binary file (e.g. trying to save time by modifying it in a text editor); and use defensive coding practices to guard against this; including extensive error checking throughout all code that reads/parses the file, and including detailed human readable error messages for every case where it's possible that the file's actual data (after being tampered with) might not comply with the file format's specification.

There are a number of common approaches.

Store the string padded with nulls. (Makes it easy to determine if the string terminates before the end of the field.)
Prefix the string with a binary text length. (Can be space efficient if the next field follows immediately. However problematic for strings longer that 255 bytes).
Space pad the string to the length of the field. (Does not work well if trailing space may be relevant.)

If leading or trailing spaces are not significant then it mat be appropriate to trim the string before storage.

It appears you have taken the first option I presented, but trimming spaces before storage has not been reliably applied. It would be appropriate in this case for string comparisons to ignore trailing space.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange