Meanwhile, I have found the solution, so I add it as answer for future reference.
I checked the Symbol font with a glyph viewer software and I realized that it uses the Private Use Area of Unicode for its characters. Other fonts like Times New Roman store the concerned characters (e.g. greek letters) in normal Unicode range.
So the solution is to map the Symbol glyphs with standard Unicode glyphs. I have created a conversion table by hand for the greek letters (upper/lower case), punctuations, numbers and mathematical symbols available in the Symbol font. Note that even the order of the characters in variuos ranges differ from each other, e.g. the greek alphabet is not in the same order in Symbol and Unicode. So I had to check the character codes one by one.
When I had the conversion table, I stored it in a txt file. When my application finds a segment (run) in the Word file which is formatted with Symbol font (<w:rFonts>
tag in the example), it calls the conversion method. In this method, I parse the txt file to a HashMap
, and change the characters one by one from Symbol code to Unicode:
public String convert(String symbolString) {
StringBuilder sb = new StringBuilder();
for(int k=0; k<symbolString.length(); k++){
int origCode = Character.codePointAt(symbolString, k);
Integer replaceCode = conversionTable.get(origCode);
if(replaceCode != null) {
sb.append(Character.toChars(replaceCode));
} else {
sb.append("?");
}
}
return sb.toString();
}
Where conversionTable
is the HashMap
object containing the replace codes as hex values.