سؤال
I am migrating a number of databases to UTF-8, and just discovered phenomena that I was not aware of. When selecting data out to the terminal, additional bytes are added to the output. Example:
~]$ x=$(db2 -x "values 'a'")
~]$ echo "${x}b"
a b
One additional space after a
~]$ x=$(db2 -x "values 'aa'")
echo "${x}b"
aa b
Two additional space after aa
It does not seem to matter how many bytes a character occupies in utf8:
~]$ x=$(db2 -x "values '𝄞'")
~]$ echo "${x}b"
𝄞 b
One additional space after g-clef
~]$ x=$(db2 -x "values '𝄞𝄞'")
~]$ echo "${x}b"
𝄞𝄞 b
Two additional space after g-clef g-clef
db cfg:
Database territory = SE
Database code page = 1208
Database code set = UTF8
Database country/region code = 46
Database collating sequence = SYSTEM_819_SE
The terminal has encoding UTF8 (tried terminator and gnome-terminal), and before connecting to the database I did:
export LC_CTYPE=sv_SE.utf8
The above is of course just silly examples, but I have a fair amount of tests in scripts similar to that:
dbtype=`db2 -x "values nya.get_db_type()"`
if [ "${dbtype}" = "N" ]; then
...
where I need to change the test one way or another.
Any thoughts on a configuration, that would get rid of the extra bytes?
~]$ uname -a
Linux nya-ladok3-release 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
~]$ db2level
DB21085I This instance or install (instance name, where applicable:
"db2inst1") uses "64" bits and DB2 code release "SQL11050" with level
identifier "0601010F".
Informational tokens are "DB2 v11.5.0.0", "s1906101300", "DYN1906101300AMD64",
and Fix Pack "0".
Product is installed at "/opt/ibm/db2/V11.5".
المحلول
This is the explanation I got from IBM support, I have not tested it myself, but it seems reasonable. The workaround suggested will work for all situations that I can think of right now:
The extra spaces padded in CLP is the expected behavior. This is because there are some multiple-byte characters that take more than 1-physical space. See below example that demonstrates the same:
Say, U+FF2D FULLWIDTH LATIN CAPITAL LETTER M
$ db2 "select a, hex(a) from table(values('A' || u&'\FF2D' || 'B'),('A B')) t(a)"
A 2
----- ----------
AMB 41EFBCAD42 —> Here wide-M is taking more space than a normal multi/single byte character.
A B 41204220202 record(s) selected.
If you are using a lot of functions like get_db_type() in your scripts, you can change the return value in these functions to modify it to OCTETS:
Something like the following:
CREATE OR REPLACE FUNCTION get_db_type()
RETURNS VARCHAR(1 OCTETS)
DETERMINISTIC NO EXTERNAL ACTION CONTAINS SQL
BEGIN ATOMIC
RETURN 'N';
END @
I'll leave the current workaround as is below:
If nothing else shows up I'll probably do something along the lines of:
x=$(db2 -x "values ('𝄞𝄞')")
x="${x%"${x##*[![:space:]]}"}"
echo ${x}b
𝄞𝄞b
At first, I figured:
x=$(db2 -x "values ('𝄞𝄞')" | xargs echo)
would do, but the pipe introduces a subshell, so:
echo ${x}b
SQL1024N A database connection does not exist. SQLSTATE=08003b
it does not work.