clp appends bytes to output

https://dba.stackexchange.com/questions/266763

db2
db2-luw

01-03-2021
|

سؤال

I am migrating a number of databases to UTF-8, and just discovered phenomena that I was not aware of. When selecting data out to the terminal, additional bytes are added to the output. Example:

~]$ x=$(db2 -x "values 'a'")
~]$ echo "${x}b"
a b

One additional space after a

~]$ x=$(db2 -x "values 'aa'")
echo "${x}b"
aa  b

Two additional space after aa

It does not seem to matter how many bytes a character occupies in utf8:

~]$ x=$(db2 -x "values '𝄞'")
~]$ echo "${x}b"
𝄞 b

One additional space after g-clef

~]$ x=$(db2 -x "values '𝄞𝄞'")
~]$ echo "${x}b"
𝄞𝄞  b

Two additional space after g-clef g-clef

db cfg:

Database territory                                      = SE
Database code page                                      = 1208
Database code set                                       = UTF8
Database country/region code                            = 46
Database collating sequence                             = SYSTEM_819_SE

The terminal has encoding UTF8 (tried terminator and gnome-terminal), and before connecting to the database I did:

export LC_CTYPE=sv_SE.utf8

The above is of course just silly examples, but I have a fair amount of tests in scripts similar to that:

dbtype=`db2 -x "values nya.get_db_type()"`
if [ "${dbtype}" = "N" ]; then
    ...

where I need to change the test one way or another.

Any thoughts on a configuration, that would get rid of the extra bytes?

~]$ uname -a
Linux nya-ladok3-release 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
~]$ db2level
DB21085I  This instance or install (instance name, where applicable: 
"db2inst1") uses "64" bits and DB2 code release "SQL11050" with level 
identifier "0601010F".
Informational tokens are "DB2 v11.5.0.0", "s1906101300", "DYN1906101300AMD64", 
and Fix Pack "0".
Product is installed at "/opt/ibm/db2/V11.5".

المحلول

This is the explanation I got from IBM support, I have not tested it myself, but it seems reasonable. The workaround suggested will work for all situations that I can think of right now:

The extra spaces padded in CLP is the expected behavior. This is because there are some multiple-byte characters that take more than 1-physical space. See below example that demonstrates the same:

Say, U+FF2D FULLWIDTH LATIN CAPITAL LETTER M

$ db2 "select a, hex(a) from table(values('A' || u&'\FF2D' || 'B'),('A B')) t(a)"
A   2     
----- ----------
AＭB 41EFBCAD42 —> Here wide-M is taking more space than a normal multi/single byte character. 
A B  41204220202 record(s) selected.

If you are using a lot of functions like get_db_type() in your scripts, you can change the return value in these functions to modify it to OCTETS:

Something like the following:

CREATE OR REPLACE FUNCTION get_db_type()
RETURNS VARCHAR(1 OCTETS)
DETERMINISTIC NO EXTERNAL ACTION CONTAINS SQL
BEGIN ATOMIC
    RETURN 'N';

END @

I'll leave the current workaround as is below:

If nothing else shows up I'll probably do something along the lines of:

x=$(db2 -x "values ('𝄞𝄞')")
x="${x%"${x##*[![:space:]]}"}"
echo ${x}b
𝄞𝄞b

At first, I figured:

x=$(db2 -x "values ('𝄞𝄞')" | xargs echo)

would do, but the pipe introduces a subshell, so:

echo ${x}b
SQL1024N A database connection does not exist. SQLSTATE=08003b

it does not work.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange