Question

We have written and application that will open up Microsoft Word Documents, read all of the text inside, and then send out that data to an external system for processing. This was working fine in the past, but since we've gone to accept Unicode, we've been having some issues reading the Word Documents.

The issues that we are seeing is that we are unable to display any characters that take up more than one code unit, such as 𠆾 (Surrogate Pair), or ā̈ (Grapheme Cluster). When we try to display the 𠆾, we get two ??, and with the ā̈, we get each individual character that makes up the grapheme.

I have a feeling that the reason we are seeing these characters returned like this is because we are not reading in the file properly. However I have been searching and haven't found a solution yet.

I have created a Word Document that contains only a single value: 𠆾.

The first thing we do in the code is read the file into a byte array:

FileStream fileStream = new FileStream(fileName, FileMode.Open, FileAccess.Read);

wordDocument = new byte[fileStream.Length];
fileStream.Read(wordDocument, 0, (int)fileStream.Length);

fileStream.Close();

Upon further investigation of the byte array contains the following values:

{63, 63, 10, 13}, or in hex {0x3f, 0x3f, 0x0d, 0x0a}

From looking up the hex values, I have learned that 0x3f relates to a ?, which explains why we are getting back two ??.

Then, when we try to convert the data back to string, we end up getting back the two ??

textdata = System.Text.Encoding.Unicode.GetString(wordDocument);

I figure the issue may be with how we are reading in the document, but I am not 100% sure. Can anyone guide me on the correct path?

Was it helpful?

Solution

You can use MS Office Primary Interop Assemblies to access the object model of word document. Try add an assembly reference in Visual Studio (smth. like Office 12 or Microsoft Word 12 etc). Check out this link. There is some basics there.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top