Efficient ByteArrayInputStream manipulation

https://stackoverflow.com/questions/9726821

24-05-2021
|

Frage

I am working with a ByteArrayInputStream that contains an XML document consisting of one element with a large base 64 encoded string as the content of the element. I need to remove the surrounding tags so I can decode the text and output it as a pdf document.

What is the most efficient way to do this?

My knee-jerk reaction is to read the stream into a byte array, find the end of the start tag, find the beginning of the end tag and then copy the middle part into another byte array; but this seems rather inefficient and the text I am working with can be large at times (128KB). I would like a way to do this without the extra byte arrays.

Lösung

Do your search and conversion while you are reading the stream.

// find the start tag
byte[] startTag = new byte[]{'<', 't', 'a', 'g', '>'};
int fnd = 0;
int tmp = 0;
while((tmp = stream.read()) != -1) {
 if(tmp == startTag[fnd]) 
  fnd++;
 else
  fnd=0;
 if(fnd == startTage.size()) break;
}

// get base64 bytes
while(true) {
 int a = stream.read();
 int b = stream.read();
 int c = stream.read();
 int d = stream.read();
 byte o1,o2,o3; // output bytes
 if(a == -1 || a == '<') break;
 //
 ...
 outputStream.write(o1);
 outputStream.write(o2);
 outputStream.write(o3);
}

note The above was written in my web browser, so syntax errors may exist.

Andere Tipps

Base 64 does not use the characters < or > so I'm assuming you are using a web-safe base64 variant meaning you do not need to worry about HTML entities or comments inside the content. If you are really sure that the content has this form, then do the following:

Scan from the right looking for a '<'. This will be the beginning of the close tag.
Scan left from that position looking for a '>'. This will be the end of the start tag.

The base 64 content is between those two positions, exclusive.

You can presize your second array by using

((end - start + 3) / 4) * 3

as an upper bound on the decoded content length, and then b64decode into it. This works because each 4 base64 digits encodes 3 bytes.

If you want to get really fancy, since you know the first few bytes of the array contain ignorable tag data and the encoded data is smaller than the input, you could destructively decode the data over your current byte buffer.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow