best way to export binary data with additional text attributes to a stream

https://softwareengineering.stackexchange.com/questions/273064

07-10-2020
|

Domanda

I would like to create a command line API in a Java application to export a binary blob to stdout. I would additionally like to export certain attributes about the blob, ideally in a non-binary format. The goal is to output this data in a manner that is convenient and efficient to write and read. I am fairly set on using stdout because it allows me to easily pipe the data into another process without creating intermediate files, which is a huge convenience.

I could write the text attributes on one tab-separated line and follow them by the binary blob. But since stdout is not structured, it would be awkward to parse if text and binary data are mixed. For example, the stream would probably have to be read in binary mode until a byte equalling an ASCII newline character is found and then the data would have to be split at that point.

Alternately I could wrap all the data in a structured format. The formats that I am most familiar with, like JSON and XML, are text formats and thus the binary data would have to be encoded, which would greatly increase computation time and data size. But there may be formats designed for this purpose.

Thirdly, I could implement a binary format similar to common file formats with a header containing fields that either have fixed byte length or are preceded by byte lengths. However, I am concerned that such a format would be unnecessarily non-standard and annoying to write and parse. In particular, I don't see the benefit of such a format over the first option where a newline separates the attributes and data.

There may be other options as well but I'm not thinking of any good ones. I would very much appreciate others' suggestions on this matter.

Soluzione

If you just need a simple solution, and you can tolerate non-ASCII bytes in the header region, just use Protocol Buffers.

Link to Protocol Buffers

It is safe to use JSON for the human readable header of your file, followed by the raw binary blob without encoding or encapsulation, with the following preconditions:

Link to JSON specification

Your JSON needs to be an object at the top level.
- This means your entire human readable header will be delimited by an opening brace, and a closing brace.
- Anything else in between are properly quoted, escaped, and balanced.
You need to be specific about newlines.
- Newlines that occur inside the JSON stream can be handled liberally.
- However, you must be strict about whether a newline (or any other characters or byte sequences) are allowed to occur between
  - The end of the JSON stream, and
  - The beginning of the binary blob.
- Remember that newline have different representations on different OSes.
- Also, be explicit about the handling of consecutive occurrences of newlines. That could result from common programming mistakes which leave you with no choice but to tolerate and work around it.
You first stage of parsing needs to be done a JSON reader that detects the end of the top-level object, and ignores the rest of unencapsulated binary data, as explained above.
- This first-stage parser do not need to handle the metadata; it just need to correctly identify the end of the header.
- Once the header is separated, it can be passed into a second JSON reader which loads the metadata into objects.

If you are concerned about the need for software to extract the binary blob without having to rely on a correctly implemented JSON seeker, you may try this simple hack:

Make the requirement that the first field value of the top-level JSON object is a decimal number that is the number of bytes of the binary blob.

Your non-JSON-aware reader can then quickly delimit the binary blob as follows:

Open the stream in binary mode.
Discard all bytes not in the ASCII range of decimal digits, '0' - '9', until the first decimal digit is found.
Parse the decimal digit until a non-digit is found.
Seek to the end of file, and then recalculate the starting offset of the binary blob by subtracting its expected length (detected above) from its end position.

If you need the whole transmission to be sequentially parseable (streamable), you can do this instead:

Encode the metadata in JSON in a memory stream.
- Do not write it to the output yet; that will be several steps later.
- Be specific about the output encoding.
- From the next step and on, we will treat it as a byte stream, which means it cannot be "edited" after this step.
Get the number of bytes of the encoded JSON byte stream.
Transmit this byte count as ASCII decimal-encoded number, followed by the newline.
- As mentioned above, be explicit about what "newline" means.
Transmit the encoded JSON byte stream (from the memory stream), exactly as it is.
- The number of bytes transmitted must match the ASCII decimal-encoded number.
Transmit the rest of the binary blob.

As a slight modification, you can probably transmit the byte stream lengths of both the JSON stream and binary blob.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a softwareengineering.stackexchange