Query to find rows containing ASCII characters in a given range

https://dba.stackexchange.com/questions/251740

16-02-2021
|

質問

I am using some scripts from another topic, but the accepted answer isn't working for all my data scenarios. I would have asked my question on the original How to check for Non-Ascii Characters post, but I don't have enough reputation to comment or up-vote as of yet.

Questions:

Why is the accepted answer not working for char(150)?
Is there a way to fix the accepted answer?

My Testing

I created SQL Fiddle with sample data, the stored procedure from one of the answers, and queries to demonstrate the issue.

Query 1: sample_table

-- Note: The "bad dash" row has char(150)

SELECT * FROM sample_table;

+-------------------+
|    DataColumn     |
+-------------------+
| test - good dash  |
| test – bad dash   |
+-------------------+

Query 2: The other answer by John shows the "bad dash" row containing char(150):

SELECT dbo.Find_Invalid_Chars(DataColumn) [Invalid Characters]
FROM sample_table
WHERE dbo.Find_Invalid_Chars(DataColumn) IS NOT NULL;

+----------------------+
|  Invalid Characters  |
+----------------------+
| test [150] bad dash  |
+----------------------+

Query 3: The accepted answer by Martin Smith returns no results:

SELECT DataColumn AS [Bad Data]
FROM sample_table
WHERE DataColumn LIKE '%[' + CHAR(127)+ '-' +CHAR(255)+']%' COLLATE Latin1_General_100_BIN2;

+------------+
| [Bad Data] |
+------------+

-- No rows returned.

Conclusion

Unfortunately, I often need to find characters within (or outside of) a range in databases I can't create stored procedures in. I'd really like to find a fix for the accepted answer or a simple script that wouldn't require creation of any objects (including temp tables).

Any suggestions? Thanks in advance.

EDIT 1: The solution cannot modify or add any objects or settings in the database. I'm looking for a self-contained query that will select rows with one or more characters in a range between two CHAR() numbers, regardless of the ASCII or Extended ASCII number supplied.

EDIT 2: The DataColumn can be in either VARCHAR or NVARCHAR. I have no control over this, so I'm hoping to figure out a self-contained query that works for both. The purpose of the query is to find characters in the source table/column that are not handled correctly by some software applications. The applications are interpreting the source correctly, but sometimes have issues with characters outside "standard" ranges, though the ranges vary by application.

解決

Why is the accepted answer not working for char(150)?

Actually, it does. The problem is your test is bad / invalid. You test column, DataColumn, is using NVARCHAR instead of VARCHAR. The character itself works in both datatypes, but the behavior is different due to how it is being used in each case:

In the Find_Invalid_Chars() function (i.e. the "other" answer), the string is being converted back into VARCHAR since that is the datatype of the input parameter for that function. In this case it works as expected (although I believe it can be done much more efficiently than that loop, but that's for another time ;-)
In the LIKE query (i.e. the "accepted" answer), the expanded and concatenated result of '%[' + CHAR(127)+ '-' +CHAR(255)+']%' is actually converted into NVARCHAR since that is the datatype of the column it is being compared to (and NVARCHAR has a higher datatype precedence), hence that LIKE function is not behaving as expected: either the CHAR(255) character maps to a different code point, and/or the CHAR(150) character in the column itself maps to a different code point (the CHAR(127) character does not change as it's in the standard ASCII range). In either case, the conversion to NVARCHAR is causing the numerical value of the "En Dash" character ("–") to no longer be within that range. Meaning, the LIKE function is looking for values, y, between 127 and x (where x >= 128), and y for the "En Dash" character is now > x. Whereas in VARCHAR, x = 255 and y = 150.

The quick fix to see that it does work is simply to change the NVARCHAR datatype of the DataColumn column to be VARCHAR (yes, just remove the initial "N"), then re-build the schema, then execute, and the LIKE query will behave as expected.

The following might help explain why making the test column NVARCHAR caused the LIKE query to not match the row:

SELECT UNICODE(CHAR(127)) AS [CHAR(127)],
       UNICODE(CHAR(150)) AS [CHAR(150)],
       UNICODE(CHAR(255)) AS [CHAR(255)];

/*
CHAR(127)     CHAR(150)     CHAR(255)
127           8211          255
*/

As you can see in the results below the query, the "bad dash", which was CHAR(150) became NCHAR(8211) when stored in the NVARCHAR column. And, since that predicate is using a binary collation (generally the correct thing to do in this scenario), it was looking at the code points / values, not the characters. Hence, the LIKE clause was looking for characters with values between 127 and 255, and 8211 is typically not in that range ;-).

P.S. Please keep in mind that the function CHAR(150) can return different characters, or even NULL, based on the default collation of the database in which you execute that function. This is because VARCHAR data is based on code pages, and those are determined by the collation, and the collation being used when executing the CHAR() function is the default collation of the active / current database. This affects values 128 - 255. Values 0 - 127 will always return the same characters, regardless of collation, as those are the standard ASCII character set and are the same across all of the code pages supported in SQL Server (though not in all code pages in general).

P.P.S. ALSO, I just noticed a slight difference in logic between the function and the query (i.e. the two answers from the linked question): CHAR(127) is considered good / valid in the Find_Invalid_Chars() function, yet it's considered bad / invalid in the LIKE query. If it were me, I would consider CHAR(127) valid as it's part of the standard ASCII character set. But, you need to decide what you consider it. Just be aware of that difference in case you do need to adjust the LIKE syntax a little.

Given:

The purpose of the query is to find characters in the source table/column that are not handled correctly by some software applications.

and:
The data can be in either VARCHAR or NVARCHAR.

I would say that:

You don't want to convert NVARCHAR source data into VARCHAR as there might be "best fit" mappings that translate invalid source characters into valid characters, but one or more of your software applications might not use "best fit" mappings.

SELECT NCHAR(178) AS [Unicode], -- Superscript 2 (U+00B2)
       CONVERT(VARCHAR(5), NCHAR(178)
                   COLLATE SQL_Latin1_General_CP1_CI_AS) AS [CodePage-1252],
       CONVERT(VARCHAR(5), NCHAR(178)
                   COLLATE Turkmen_100_CI_AS) AS [CodePage-1250]

/*
Unicode    CodePage-1252    CodePage-1250
²          ²                2
*/

It will likely be more reliable to look for characters not in a specific "valid" range as opposed to those in a specific invalid range, especially when dealing with NVARCHAR which holds a lot more than 256 characters.
You could get away with a single query if the "valid" range is always between values 0 and 127 (since those values are the same in both cases). But if you need to specify values above 127, then you will need one query for VARCHAR and one for NVARCHAR.

All that being said:

The following query returns rows containing at least one character that is not in the range of 0 - 127, for both VARCHAR and NVARCHAR. But, it only works with NVARCHAR columns for values above 127.

SELECT *
FROM   (VALUES (NCHAR(178)), (NCHAR(8211)), (N''), (NULL), (N'xy' + NCHAR(165)),
           (N'AA'), (N'mM' + NCHAR(999) + N'Nn'), (N'#!~()')) tmp(TestValue)
WHERE  tmp.[TestValue] LIKE N'%[^' + NCHAR(0) + N'-' + NCHAR(127)
          + N']%' COLLATE Latin1_General_100_BIN2;

/*
TestValue
²
–
xy¥
mMϧNn
*/

The following query also returns rows containing at least one character that is not in the range of 0 - 127, but only works for VARCHAR columns. However, it does allow using values between 128 and 255.

SELECT *
FROM   (VALUES (CHAR(178)), (CHAR(150)), (''), (NULL), ('AA'), ('#!~()'),
        ('xy' + CONVERT(VARCHAR(5), NCHAR(165) COLLATE Latin1_General_100_BIN2)),
        ('mM' + CONVERT(VARCHAR(5), NCHAR(199) COLLATE Latin1_General_100_BIN2) + 'Nn')
       ) tmp(TestValue)
WHERE  tmp.[TestValue] LIKE '%[^' + CHAR(0) + '-' + CHAR(127)
          + ']%' COLLATE Latin1_General_100_BIN2;

/*
TestValue
²
–
xy¥
mMÇNn
*/

Regarding:

The applications are interpreting the source correctly, but sometimes have issues with characters outside "standard" ranges, though the ranges vary by application.

Not sure I understand how there can be "issues" with some characters if the app is interpreting the source data correctly, unless you meant that they "mostly" interpret the data correctly.
The ranges varying by application sound like this might require more detailed investigation than can be done in a simple Q & A format such as this. This behavior can be due to them using different drivers to connect (ODBC / OLEDB / etc), what language they are written in, what assumptions they are making about the data they are getting, and so on. Some problems might be fixable with a configuration (no code change) of the app, some might be fixable only with a code change, etc.

ライセンス： CC-BY-SA と帰属

所属していません dba.stackexchange