Question

I have a very complex problem that is difficult to explain properly. There is LOTS of discussion about this across the internet, but nothing definitive. Any help, or better explanation than mine, is greatly appreciated.

Essentially, I'm just trying to write an XML file using UTF-16 with VBA.

If I do this:

sXML = "<?xml version='1.0' encoding='utf-8'?>"
sXML = sXML & rest_of_xml_document
Print #iFile, sXML

then I get a file that is valid XML. However, if I change the "encoding=" to "utf-16", I get this error from my XML validator:

Switch from current encoding to specified encoding not supported.

Googling tells me that this means the xml encoding attribute is different to the ACTUAL encoding used by the file, hence I must be creating a utf-8 document via Open and Print commands.

If I do something like:

With CreateObject("ADODB.Stream")
  .Type = 2
  .Charset = "utf-16"
  .Open
  .WriteText sXML
  .SaveToFile sFilename, 2
  .Close
End With

then I end up with some funky characters (the BOM) at the beginning of my file which causes it to fail XML validation.

If I open the file in Notepad++, delete the BOM and change the Encoding to "UCS-2", then the file validates fine with a "utf-16" encoding value (meaning that UCS-2 is close enough to UTF-16 that it doesnt matter, or that XML is able to Switch from current encoding between these two types.

I need to use UTF-16 because UTF-8 doesn't cover all the characters used in the presentations I'm exporting.

The question:

How can I get VBA to behave like Notepad++, creating a UTF-16-encoded text file without a BOM that can be filled with XML data? ANY help much appreciated!

Was it helpful?

Solution

Your point about UTF-8 not being able to store all characters you need is invalid.
UTF-8 is able to store every character defined in the Unicode standard.
The only difference is that, for text in certain languages, UTF-8 can take more space to store its codepoints than, say, UTF-16. The opposite is also true: for certain other languages, such as English, using UTF-8 saves space.

VB6 and VBA, although store strings in memory in Unicode, implicitly switch to ANSI (using the current system code page) when doing file IO. The resulting file you get is NOT in UTF-8. It is in your current system codepage, which, as you can discover in this helpful article, looks just like UTF-8 if you're from USA.

Try:

Dim s As String
s = "<?xml version='1.0' encoding='utf-16'?>"
s = s & ChrW$(&H43F&) & ChrW$(&H440&) & ChrW$(&H43E&) & ChrW$(&H432&) & ChrW$(&H435&) & ChrW$(&H440&) & ChrW$(&H43A&) & ChrW$(&H430&)

Dim b() As Byte
b = s

Open "Unicode.txt" For Binary Access Write As #1
Put #1, , b
Close #1

And if you absolutely must have UTF-8, you can make yourself some:

Option Explicit

Private Declare Function WideCharToMultiByte Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByRef lpMultiByteStr As Byte, ByVal cchMultiByte As Long, ByVal lpDefaultChar As String, ByRef lpUsedDefaultChar As Long) As Long

Private Const CP_UTF8 As Long = 65001
Private Const ERROR_INSUFFICIENT_BUFFER As Long = 122&


Public Function ToUTF8(s As String) As Byte()

  If Len(s) = 0 Then Exit Function


  Dim ccb As Long
  ccb = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), ByVal 0&, 0, vbNullString, ByVal 0&)

  If ccb = 0 Then
    Err.Raise 5, , "Internal error."
  End If

  Dim b() As Byte
  ReDim b(1 To ccb)

  If WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), b(LBound(b)), ccb, vbNullString, ByVal 0&) = 0 Then
    Err.Raise 5, , "Internal error."
  Else
    ToUTF8 = b
  End If

End Function
Sub Test()
  Dim s As String
  s = "<?xml version='1.0' encoding='utf-8'?>"
  s = s & ChrW$(&H43F&) & ChrW$(&H440&) & ChrW$(&H43E&) & ChrW$(&H432&) & ChrW$(&H435&) & ChrW$(&H440&) & ChrW$(&H43A&) & ChrW$(&H430&)

  Dim b() As Byte
  b = ToUTF8(s)

  Open "utf-8.txt" For Binary Access Write As #1
  Put #1, , b
  Close #1
End Sub
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top