Query to find rows containing ASCII characters in a given range
-
16-02-2021 - |
質問
I am using some scripts from another topic, but the accepted answer isn't working for all my data scenarios. I would have asked my question on the original How to check for Non-Ascii Characters post, but I don't have enough reputation to comment or up-vote as of yet.
Questions:
- Why is the accepted answer not working for char(150)?
- Is there a way to fix the accepted answer?
My Testing
I created SQL Fiddle with sample data, the stored procedure from one of the answers, and queries to demonstrate the issue.
Query 1: sample_table
-- Note: The "bad dash" row has char(150)
SELECT * FROM sample_table;
+-------------------+
| DataColumn |
+-------------------+
| test - good dash |
| test – bad dash |
+-------------------+
Query 2: The other answer by John shows the "bad dash" row containing char(150):
SELECT dbo.Find_Invalid_Chars(DataColumn) [Invalid Characters]
FROM sample_table
WHERE dbo.Find_Invalid_Chars(DataColumn) IS NOT NULL;
+----------------------+
| Invalid Characters |
+----------------------+
| test [150] bad dash |
+----------------------+
Query 3: The accepted answer by Martin Smith returns no results:
SELECT DataColumn AS [Bad Data]
FROM sample_table
WHERE DataColumn LIKE '%[' + CHAR(127)+ '-' +CHAR(255)+']%' COLLATE Latin1_General_100_BIN2;
+------------+
| [Bad Data] |
+------------+
-- No rows returned.
Conclusion
Unfortunately, I often need to find characters within (or outside of) a range in databases I can't create stored procedures in. I'd really like to find a fix for the accepted answer or a simple script that wouldn't require creation of any objects (including temp tables).
Any suggestions? Thanks in advance.
EDIT 1: The solution cannot modify or add any objects or settings in the database. I'm looking for a self-contained query that will select rows with one or more characters in a range between two CHAR()
numbers, regardless of the ASCII or Extended ASCII number supplied.
EDIT 2: The DataColumn can be in either VARCHAR
or NVARCHAR
. I have no control over this, so I'm hoping to figure out a self-contained query that works for both. The purpose of the query is to find characters in the source table/column that are not handled correctly by some software applications. The applications are interpreting the source correctly, but sometimes have issues with characters outside "standard" ranges, though the ranges vary by application.
解決
Why is the accepted answer not working for char(150)?
Actually, it does. The problem is your test is bad / invalid. You test column, DataColumn
, is using NVARCHAR
instead of VARCHAR
. The character itself works in both datatypes, but the behavior is different due to how it is being used in each case:
- In the
Find_Invalid_Chars()
function (i.e. the "other" answer), the string is being converted back intoVARCHAR
since that is the datatype of the input parameter for that function. In this case it works as expected (although I believe it can be done much more efficiently than that loop, but that's for another time ;-) - In the
LIKE
query (i.e. the "accepted" answer), the expanded and concatenated result of'%[' + CHAR(127)+ '-' +CHAR(255)+']%'
is actually converted intoNVARCHAR
since that is the datatype of the column it is being compared to (andNVARCHAR
has a higher datatype precedence), hence thatLIKE
function is not behaving as expected: either theCHAR(255)
character maps to a different code point, and/or theCHAR(150)
character in the column itself maps to a different code point (theCHAR(127)
character does not change as it's in the standard ASCII range). In either case, the conversion toNVARCHAR
is causing the numerical value of the "En Dash" character ("–") to no longer be within that range. Meaning, theLIKE
function is looking for values,y
, between127
andx
(wherex
>= 128), andy
for the "En Dash" character is now >x
. Whereas inVARCHAR
,x
= 255 andy
= 150.
The quick fix to see that it does work is simply to change the NVARCHAR
datatype of the DataColumn
column to be VARCHAR
(yes, just remove the initial "N"), then re-build the schema, then execute, and the LIKE
query will behave as expected.
The following might help explain why making the test column NVARCHAR
caused the LIKE
query to not match the row:
SELECT UNICODE(CHAR(127)) AS [CHAR(127)],
UNICODE(CHAR(150)) AS [CHAR(150)],
UNICODE(CHAR(255)) AS [CHAR(255)];
/*
CHAR(127) CHAR(150) CHAR(255)
127 8211 255
*/
As you can see in the results below the query, the "bad dash", which was CHAR(150)
became NCHAR(8211)
when stored in the NVARCHAR
column. And, since that predicate is using a binary collation (generally the correct thing to do in this scenario), it was looking at the code points / values, not the characters. Hence, the LIKE
clause was looking for characters with values between 127 and 255, and 8211 is typically not in that range ;-).
P.S. Please keep in mind that the function CHAR(150)
can return different characters, or even NULL
, based on the default collation of the database in which you execute that function. This is because VARCHAR
data is based on code pages, and those are determined by the collation, and the collation being used when executing the CHAR()
function is the default collation of the active / current database. This affects values 128 - 255. Values 0 - 127 will always return the same characters, regardless of collation, as those are the standard ASCII character set and are the same across all of the code pages supported in SQL Server (though not in all code pages in general).
P.P.S. ALSO, I just noticed a slight difference in logic between the function and the query (i.e. the two answers from the linked question): CHAR(127)
is considered good / valid in the Find_Invalid_Chars()
function, yet it's considered bad / invalid in the LIKE
query. If it were me, I would consider CHAR(127)
valid as it's part of the standard ASCII character set. But, you need to decide what you consider it. Just be aware of that difference in case you do need to adjust the LIKE
syntax a little.
Given:
-
The purpose of the query is to find characters in the source table/column that are not handled correctly by some software applications.
and:
-
The data can be in either VARCHAR or NVARCHAR.
I would say that:
You don't want to convert
NVARCHAR
source data intoVARCHAR
as there might be "best fit" mappings that translate invalid source characters into valid characters, but one or more of your software applications might not use "best fit" mappings.SELECT NCHAR(178) AS [Unicode], -- Superscript 2 (U+00B2) CONVERT(VARCHAR(5), NCHAR(178) COLLATE SQL_Latin1_General_CP1_CI_AS) AS [CodePage-1252], CONVERT(VARCHAR(5), NCHAR(178) COLLATE Turkmen_100_CI_AS) AS [CodePage-1250] /* Unicode CodePage-1252 CodePage-1250 ² ² 2 */
It will likely be more reliable to look for characters not in a specific "valid" range as opposed to those in a specific invalid range, especially when dealing with
NVARCHAR
which holds a lot more than 256 characters.You could get away with a single query if the "valid" range is always between values 0 and 127 (since those values are the same in both cases). But if you need to specify values above 127, then you will need one query for
VARCHAR
and one forNVARCHAR
.
All that being said:
The following query returns rows containing at least one character that is not in the range of 0 - 127, for both
VARCHAR
andNVARCHAR
. But, it only works withNVARCHAR
columns for values above 127.SELECT * FROM (VALUES (NCHAR(178)), (NCHAR(8211)), (N''), (NULL), (N'xy' + NCHAR(165)), (N'AA'), (N'mM' + NCHAR(999) + N'Nn'), (N'#!~()')) tmp(TestValue) WHERE tmp.[TestValue] LIKE N'%[^' + NCHAR(0) + N'-' + NCHAR(127) + N']%' COLLATE Latin1_General_100_BIN2; /* TestValue ² – xy¥ mMϧNn */
The following query also returns rows containing at least one character that is not in the range of 0 - 127, but only works for
VARCHAR
columns. However, it does allow using values between 128 and 255.SELECT * FROM (VALUES (CHAR(178)), (CHAR(150)), (''), (NULL), ('AA'), ('#!~()'), ('xy' + CONVERT(VARCHAR(5), NCHAR(165) COLLATE Latin1_General_100_BIN2)), ('mM' + CONVERT(VARCHAR(5), NCHAR(199) COLLATE Latin1_General_100_BIN2) + 'Nn') ) tmp(TestValue) WHERE tmp.[TestValue] LIKE '%[^' + CHAR(0) + '-' + CHAR(127) + ']%' COLLATE Latin1_General_100_BIN2; /* TestValue ² – xy¥ mMÇNn */
Regarding:
The applications are interpreting the source correctly, but sometimes have issues with characters outside "standard" ranges, though the ranges vary by application.
- Not sure I understand how there can be "issues" with some characters if the app is interpreting the source data correctly, unless you meant that they "mostly" interpret the data correctly.
- The ranges varying by application sound like this might require more detailed investigation than can be done in a simple Q & A format such as this. This behavior can be due to them using different drivers to connect (ODBC / OLEDB / etc), what language they are written in, what assumptions they are making about the data they are getting, and so on. Some problems might be fixable with a configuration (no code change) of the app, some might be fixable only with a code change, etc.