Collate issues with wrong characters

https://dba.stackexchange.com/questions/197263

11-10-2020
|

题

Well, the problem is well known, but I'm looking for a smarter solution if there's one.

For some reason the system doesn't recognize some characters and I can't compare the columns

Here is an example of the text:

Right

ASPIRADOR ULTRASSONICO-LOCAÇAO (NOTA FISCAL SERVIÇO)

Wrong

ASPIRADOR ULTRASSONICO-LOCA€AO (NOTA FISCAL SERVI€O)

Actually I'm fixing this through this function

create function fixcollation(@ps_Texto VARCHAR(4000)) returns VARCHAR(4000) 

as 

begin  

    declare @vlgsv1itu INT declare @nxn68ezzi INT declare @dw17rsyva  VARCHAR(50) declare @iw8a2z01i VARCHAR(50) declare @t64e98xq6 VARCHAR(50) declare @zwjs2imy3 INT declare @jsyt85sy8 VARCHAR(4000)  

    ---------------------------------------------------- 

    set @dw17rsyva = ' …Æƒ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãåâ™£—–éëêš‡€§' 
    set @iw8a2z01i = 'áàãâäÁÀÃÂÄéèêëÈÉÊËíìïÍÌÏóòõôöÓÒÕÔÖúùûüÚÙÛÜçÇºØ' 
    set @jsyt85sy8 = @ps_Texto set @zwjs2imy3 = IsNull(datalength(@ps_Texto), 0) 
    set @nxn68ezzi = 1 
    while(@nxn68ezzi <= IsNull(datalength( @ps_Texto), 0)) 

    begin 

        set @vlgsv1itu = 1 

        while(@vlgsv1itu <= IsNull(datalength(@dw17rsyva), 0)) 
        begin 

            IF(ASCII(SUBSTRING(@ps_Texto, @nxn68ezzi, 1) COLLATE LATIN1_GENERAL_CS_AS) = ASCII(SUBSTRING(@dw17rsyva, @vlgsv1itu, 1) COLLATE LATIN1_GENERAL_CS_AS)) 
            BEGIN 
                set @t64e98xq6 = SUBSTRING( @iw8a2z01i, @vlgsv1itu, 1) set @jsyt85sy8 = SUBSTRING(@jsyt85sy8, 1, @nxn68ezzi -1) + @t64e98xq6 + SUBSTRING(@jsyt85sy8, @nxn68ezzi + 1, @zwjs2imy3 -  @nxn68ezzi) 
                break 
            end 
            set @vlgsv1itu = @vlgsv1itu + 1 
        end 
        set @nxn68ezzi = @nxn68ezzi + 1 
    end 
    return @jsyt85sy8 
end

So, my question is: is this the best way or have I missed something here?

EDIT

Just a complementary test

select dbo.fixcollation(' …Æƒ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãåâ™£—–éëêš‡€§')
select dbo.FixCodePage850toCodePage1252(' …Æƒ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãåâ™£—–éëêš‡€§')

And this is the result in my production environment

fixcollation

FixCodePage850toCodePage1252

My personal thanks to Solomon Rutzky

解决方案

This is an incorrect encoding issue. The characters are coming in encoded as DOS Code Page 850 yet the target Code Page you are using (based on the Latin1_General Collations) is Windows Code Page 1252. For example, in DOS Code Page 850, the Ç character has a value of 0x80 (or 128 in Decimal). However, that same value of 0x80 in Windows Code Page 1252 gives you €. Likewise, Ã in DOS Code Page 850 has a value of 0xC7 (or 199 in Decimal). However, that same value of 0xC7 in Windows Code Page 1252 gives you Ç.

The incorrect characters are incorrect due to being imported into SQL Server with the wrong encoding being specified for the source. This is not happening within SQL Server as that would be a code page conversion issue, in which case the same "character" would have its value translated for the same character in the target Code Page (if the character exists in the target Code Page, else you get ?). For example:

SELECT ASCII('Ç' COLLATE Latin1_General_CI_AS) AS [CP1252 Value],
       'Ç' COLLATE SQL_Latin1_General_CP850_CI_AS AS [CharacterInCP850],
       ASCII('Ç' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [CP850 Value];

Returns:

CP1252 Value     CharacterInCP850     CP850 Value
199              Ç                    128

Meaning, this is happening most likely during a file import — BCP.exe, SQLCMD.exe, BULK INSERT, OPENROWSET(BULK...), custom app code that reads a file, etc — where either the wrong source Code Page is being specified, or no Code Page at all is being specified for the source. If an import is being done that specifies Code Page 1252 for this file, it will have the effect that you are seeing here since those bytes are encoded for Code Page 850, not Code Page 1252.

It should be noted that this could also happen with data coming in from app code if the driver (ODBC, etc) is being told to use the wrong code page.

Now, regarding the method of fixing this:

Ideally the method of importing the data would be updated / fixed to properly account for the actual code page with which the data has been encoded.
If it is not possible to fix the import process, then the function that the other company provided is not the best way to go. In fact, it is probably the slowest, most convoluted approach which is also prone to errors (if they didn't map all of the characters). There is no reason to do two loops along with SUBSTRING when loading the characters, in pairs, into a table variable would have allowed for a single loop using the REPLACE function. And using the ASCII function and a case-sensitive, accent-sensitive Collation is unnecessary and prone to error (if two characters match what is being searched for) when using a _BIN2 Collation would have been better.

Use the following function which does the conversion. First it gets the bytes of the current string, then it injects those bytes into a VARCHAR column that uses Code Page 850, then it selects that value from the table variable into a local variable (necessary anyway to return the value) which has the effect of converting the string into the Code Page used by the default Collation of the Database (which here would have to be Code Page 1252 else you would not be getting the "correct" string out of the function):

USE [tempdb];
GO
CREATE FUNCTION dbo.FixCodePage850toCodePage1252
(
    @CodePage850String VARCHAR(8000)
)
RETURNS VARCHAR(8000)
WITH SCHEMABINDING
AS
BEGIN
  DECLARE @Convert850to1252 TABLE
  (
    [String] VARCHAR(8000) COLLATE SQL_Latin1_General_CP850_CI_AS
  );
  DECLARE @ReturnValue VARCHAR(8000);

  INSERT INTO @Convert850to1252 ([String])
  VALUES (CONVERT(VARBINARY(8000), @CodePage850String, 0));

  SELECT @ReturnValue = [String] -- automatic conversion to Code Page of database
  FROM @Convert850to1252;

  RETURN @ReturnValue;
END;
GO

Testing both functions returns the same results:

SELECT dbo.fixcollation('Ç‰§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)

SELECT dbo.FixCodePage850toCodePage1252('Ç‰§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)

I came up with a test to check the mappings of all characters just in case the company providing the translation function missed any mappings. I filtered out the graphics characters and dotless "i" that are only found in Code Page 850.

USE [tempdb];
GO
;WITH nums AS
(
    SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [num]
    FROM   master.sys.columns
), vals AS
(
  SELECT nums.[num] AS [Value],
         CHAR(nums.[num]) AS [Character],
         dbo.fixcollation(CHAR(nums.[num])) AS [OldWay],
         dbo.FixCodePage850toCodePage1252(CHAR(nums.[num])) AS [NewWay]
  FROM   nums
)
SELECT vals.*,
       ASCII(vals.[NewWay]) AS [NewValue]
FROM   vals
WHERE  vals.[Character] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND    vals.[OldWay] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND    vals.[Value] NOT IN (176, 177, 178, 179, 180, 185, 186, 187, 188,
                            191, 192, 193, 194, 195, 196, 197, 200, 201,
                            202, 203, 204, 205, 206, 217, 218, 219, 220,
                            223, 254, 213); -- characters only in CP850

That returns a list of 52 characters that could have come through the import process mistranslated like the others, but skipped by the UDF that you were provided by that other company that only handles 46 of the apparently 98 possible characters.

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange