I need to dig into a logs table with a schema similar to this:

CREATE TABLE t (
  id int PRIMARY KEY,
  data varchar(max)
);

Column data stores a XML text received from a web service in this format:

This is a reduced version

<?xml version="1.0" encoding="UTF-8"?>
<PARAM>
  <TAB DIM="30" ID="ZC3D2_1" SIZE="5">
    <LIN NUM = "1">
      <FLD NAME = "ZDOC" TYPE = "Char">Ferran López</FLD>
    </LIN>
  </TAB>
</PARAM>

When I try to cast this text to XML I get next error:

XML parsing: line xx, character 48, illegal xml character

It can be solved by removing the <xml> tag, or at least, the encoding attribute.

NOTE: It works fine if there is no special characters like ó, even if I don't remove <xml> tag.

Question

Is there a way to convert it to XML without replacing or removing <xml> tag?

CAST(REPLACE(data, 'encoding="UTF-8"', '') as XML)

db<>fiddle here

UPDATE

The server collation is: Latin1_General_BIN

But even if I try to change the collation to my usual servers collation, it doesn't work.

SELECT
  id, 
  CAST((data COLLATE Latin1_General_CI_AS) as XML)
FROM
  t;
有帮助吗?

解决方案

What's happening here is:

  1. The XML type stores data internally as UTF-16 Little Endian (most of the time, at least). It doesn't matter what the source encoding is, the end-result will be UTF-16 LE (and no <xml> tag, hence no encoding="...").
  2. When converting a string to XML:
    1. It's the bytes of the string that get converted, not the characters (will explain the difference in a moment)
    2. NVARCHAR data is assumed to be UTF-16 LE. If there is an <xml> tag and it contains the encoding attribute, the only valid value is "UTF-16".
    3. VARCHAR data is assumed to be in the 8-bit code page associated with the collation of the data when there is no <xml> tag, or if one exists but there is no encoding attribute. Else the data will be interpreted as being encoded in the code page specified in the encoding attribute (even though it is encoded in the code page associated with the collation of the data).
  3. Your data is most likely encoded as Windows code page 1252 (this is determined by the collation of the column that the data resides in, not the collation of the instance or even the database, but since you mention that the instance is using Latin1_General_BIN, it's safe-enough to assume for the moment that the column is using the same collation).
  4. The code point for the ó character in code page Windows-1252 is: 0xF3.
  5. The <xml> tag, however, is declaring that the XML data is encoded as UTF-8.
  6. In UTF-8, 0xF3 must be followed by three bytes, each being between 0x80 and 0xBF, yet in your data it's followed by a p, which has a value of 0x70. Hence you get the "illegal xml character" error (because the encoding="UTF-8" tells the conversion function that the bytes are valid UTF-8 bytes; the conversion doesn't see the ó character).

Your options are:

  1. Ideally, the column would be converted to XML and the encoding attribute of the <xml> tag, or the entire <xml> tag itself, would be removed on the way in. AND, the XML datatype can save space if there are repeating element and/or attribute names as it creates a dictionary (lookup list) of names internally and records the structure using the ID values.

  2. Set the [data] column to use a UTF-8 collation (new in SQL Server 2019, so not an option for you)

  3. Set the [data] column to be NVARCHAR, and remove the encoding attribute of the <xml> tag, or the entire <xml> tag.

  4. Convert the incoming string into UTF-8 bytes. So the ó character is two bytes in UTF-8: 0xC3B3, which appear as ó in Windows-1252.

    DECLARE @Good VARCHAR(MAX) = '<?xml version="1.0" encoding="UTF-8"?><a>hell'
            + CONVERT(VARCHAR(MAX), 0xC3B3)
            + '</a>';
    SELECT @Good, CONVERT(XML, @Good)
    -- <?xml version="1.0" encoding="UTF-8"?><a>helló</a>
    --
    -- <a>helló</a>
    

NOTES:

  • Simply removing the encoding attribute of the <xml> tag, or the entire <xml> tag, is not an option. Sure, it will work in this particular case, but it won't work in all cases due to the column being VARCHAR and UTF-8 collations not being available in SQL Server 2014. Hence, any Unicode characters not available in Windows code page 1252 will be converted to ? or ?? (depending on BMP character or Supplementary Character):
    DECLARE @Test VARCHAR(MAX) = '<test>ó - ☢ - 🌝</test>';
    SELECT @Test, CONVERT(XML, @Test);
    -- <test>ó - ? - ??</test>
    --
    -- <test>ó - ? - ??</test>
    
  • Do NOT simply change the collation of the column to a different locale / culture. While that might get rid of the error, it would only accomplish that by silently getting rid of the data that was causing the error. For example:
    DECLARE @Data NVARCHAR(MAX) = N'ó';
    SELECT CONVERT(VARCHAR(MAX), @Data COLLATE Latin1_General_BIN) AS [Latin1_General],
        CONVERT(VARCHAR(MAX), @Data COLLATE Latin1_General_BIN) COLLATE
                 Cyrillic_General_CI_AS AS [Cyrillic];
    /*
    Latin1_General    Cyrillic
    ó                 o
    */
    
    "Cyrillic" uses a different code page than "Latin1_General", and the ó character is not available on the Cyrillic code page. But, there is a "Best Fit" mapping which is why we end up with an o instead of a ?.
  • You, and anyone working on SQL Server 2008 or newer, really should be using the _100_ level collations. Additionally, anyone working on SQL Server 2012 or newer should ideally be using the _100_ level collation that ends with _SC (for Supplementary Characters). Finally, when needing a binary collation on SQL Server 2005 or newer, use one ending in _BIN2 (see my post here as to why).
  • This issue has nothing to do with whether the query is ad hoc or in a stored procedure (T-SQL or SQLCLR).

其他提示

Your XML stored in a varchar(max) column should look like this.

<?xml version="1.0" encoding="UTF-8"?>
<PARAM>
  <TAB DIM="30" ID="ZC3D2_1" SIZE="5">
    <LIN NUM = "1">
      <FLD NAME = "ZDOC" TYPE = "Char">Ferran López</FLD>
    </LIN>
  </TAB>
</PARAM>

The ó should be represented with a double byte value ó.

If you don't have a UTF-8 encoded string stored in your column, the right way to go about this is to remove the encoding from the XML before you convert the value to the XML datatype.

I think you have a deeper problem. UTF-8 allow for more characters than the regular non-Unicode collations in SQL server. So, to be safe you should either use SQL Server 2019 which has UTF-8 collations (and I sunderstand if that isn't doable/desirable for many reasons) os use (try) nvarchar instead of varchar.

If you are afraid of storage increase going from varchar to nvarchar, you can possibly use row compression. But that requires Enterprise Edition prior to SQL Server 2016.

use a compatible collation for your varchar

CREATE TABLE t (
  id int PRIMARY KEY,
  data varchar(max) COLLATE Latin1_General_100_CI_AI_SC_UTF8
);

INSERT INTO t VALUES
(1, N'<?xml version="1.0" encoding="UTF-8"?>
<PARAM>
  <TAB DIM="30" ID="ZC3D2_1" SIZE="5">
    <LIN NUM = "1">
      <FLD NAME = "ZDOC" TYPE = "Char">Ferran López</FLD>
    </LIN>
  </TAB>
</PARAM>
')
GO
SELECT
  id, 
  CAST(data as XML)
FROM
  t;
GO
id | (No column name)                                                                                                            
-: | :---------------------------------------------------------------------------------------------------------------------------
 1 | <PARAM><TAB DIM="30" ID="ZC3D2_1" SIZE="5"><LIN NUM="1"><FLD NAME="ZDOC" TYPE="Char">Ferran López</FLD></LIN></TAB></PARAM>

db<>fiddle here

许可以下: CC-BY-SA归因
不隶属于 dba.stackexchange
scroll top