Question

I have a program that has to decode a base 64 string and then decode it again as UTF-8. The program is pulling the text from a .doc and then downloading it locally from Dropbox (using Temboo). There are still weird characters before and after the document. This is what a section of the page looks like in Microsoft Word 2011 for Mac:

Image of characters

I tried to put the text into text decoders online and couldn't seem to find what encoding the chunk of text above was. This is how I am currently decoding the text:

encoded = encoded.replaceAll("\r\n", "");
encoded = encoded.replaceAll("\n", "");
encoded = encoded.replaceAll("\r", "");

// decoding the response
decoded = StringUtils.newStringUtf8(Base64.decodeBase64(encoded));

In TextEdit.app it looks like this:

TextEdit

Does anyone know what encoding this is and how I can decode these characters?

Was it helpful?

Solution

Here is the first part of a Word .docx file, in hex:

50 4b 03 04 14 00 06 00 08 00 00 00 21 00 e1 0f
8e bf 8d 01 00 00 29 06 00 00 13 00 08 02 5b 43
6f 6e 74 65 6e 74 5f 54 79 70 65 73 5d 2e 78 6d
6c 20 a2 04 02 28 a0 00 02 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Note that each 2-digit value above is one character. The first two values -- 50 and 4b -- are the ASCII characters P and K. (Google "ASCII table" and you will see what I mean.)

Here is all the character data you can see:

PK[Content_Types].xml

If you look at the hex values, anything with a value above 0x7F is not a valid ASCII/UTF8 character. When such data is transmitted over the internet via certain protocols, the data is apt to get garbled (since protocols expect ASCII characters) unless it's somehow encoded into ASCII. This is the purpose of "Base-64".

Base-64 encodes the above data as:

UEsDBBQABgAIAAAAIQDhD46/jQEAACkGAAATAAgCW0NvbnRlbn
RfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This can be safely transmitted, since all values are regular ASCII characters (their numeric values are below 0x7f).

When you decode the Base-64 you presumably get back the same data that you started with, so if you write that data to a file you will have "reconstituted" the original .docx file.

If, on the other hand, you feed the decoded data (or data never encoded) into a byte to string converter (such as newStringUtf8) then the characters larger than 0x7f are interpreted as UTF8 sequences and translated into the corresponding UTF16 or UTF32 characters. But "binary" data (such as the header data in a .doc or .docx file) is just numbers -- it's not character data. Converting those binary values to UTF characters produces nothing meaningful. Further, some of the values do not survive the conversion and will not convert back correctly.

The way to deal with this file is to "reconstitute" the .doc file from Base-64 form to "binary", write that data as a "binary" file. and then use software that understands how to read its header and take it apart sensibly. This would be either Word itself or some API written specifically to access the innards of Word files.

OTHER TIPS

There is no base64 in your samples. I would recommend you use a Office format lib (like POI) to extract text/data from Office documents.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top