Question

I need urgent help. I can't compare charset strings. A string written to a database table1 is utf-8 charset but looks still strange: SADI However a string written to table2 in the same database is SADI which is normal. whenever I compare both, it gives false.

  1. Any idea how comparison can be made? (actually comparison should give true result)

  2. Any idea how I can insert SADI as SADI to a database.

Either will be a solution hopefully.

Was it helpful?

Solution

In your strings, SADI is standard ASCII string, but SADI is using full-width Unicode characters.

For example, is U+FF33 'FULLWIDTH LATIN CAPITAL LETTER S' (UTF-8: 0xEF 0xBC 0xB3),

but S is standard ASCII U+0053 'LATIN CAPITAL LETTER S' (UTF-8 0x53).

Other characters are also similar extended Unicode characters, which look like standard Latin script, but in reality are not.

How did they get there - that's a good question. Probably somebody got really creative and copy-pasted something from Word? Who knows.

You can convert these strange characters back to normal ones by applying Unicode NFKC (Unicode Normalization Form KC) by using this Perl script as a filter (it accepts UTF-8 and outputs normalized UTF-8):

use Unicode::Normalize;
binmode STDIN,  ':utf8';
binmode STDOUT, ':utf8';
while(<>) { print NFKC($_); }

In php:

$result = Normalizer::normalize( $str, Normalizer::FORM_KC );

Requires the intl extension

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top