Converting UTF-8 PostgreSQL DB into WIN-1255 Shapefile

https://stackoverflow.com/questions/1896474

19-09-2019
|

Question

I have a PostgreSQL\PostGIS spatial database which contains Hebrew text columns. The system runs on Ubuntu, and everything works flawlessly with UTF-8.

I am trying to dump some tables into shapefile for a Windows program which can only read Windows-1255 strings. Unfortunately, pgsql2shp has no encoding option, although shp2pgsql has, so the Widnows program reads UTF-8 parsed as Windows-1255 giving Gibberish.

I have been trying to create an Windows-1255 view to the table columns, but found no way of doing it without corrupting the database.

Any ideas how to convert the tables?

Thanks,

Adam

UPDATE:

I Thought this one was solved (see my own answer), by I still get random errors like:

ERROR:  character 0x9f of encoding "WIN1255" has no equivalent in "UTF8"

What I want is some kind of omit functionality: like iconv's -c flag, which simply does not copy source characters which have no equivalent int target encoding.

Solution

If you really mean ASCII, you can't possibly rescue Hebrew characters. ASCII is only the 7-bit character set up to \x7F.

So what kind of strings does this Windows program read? If it's ASCII, or Latin-1, you'll never get Hebrew. More likely it's “the current system code page”, also (misleadingly but commonly) known in Windows as ‘ANSI’.

If that's the case you will have to set the system code page on every machine that runs the Windows program to Hebrew (code page 1255). I believe shp files have no character encoding information at all, so the shapefiles will only ever work correctly on machines with this code page set (the default only in the Israel locale). (Apparently .dbf exports can have an accompanying .cpg file to specify the encoding, but I've no idea if the program you're using supports that.)

Then you'd have to export the data as code page 1255, or the nearest you're going to get in Postgres, ISO-8859-8. Since the export script doesn't seem to have any option to do anything but take direct bytes from the database, you'd have to create a database in the ISO-8859-8 encoding and transfer all the data from the UTF-8 database to the 8859-8 one, either directly through queries or, perhaps easier, using pgdumpall and loading the SQL into Notepad then re-saving it as Hebrew instead of UTF-8 (adjusting any encoding settings listed in SQL DDL as you go).

I wonder if the makers of the Windows program could be persuaded to support UTF-8? It's a bit sad to be stuck with code-page specific software in this century.

OTHER TIPS

From within the bash script:

select ENCODING in UTF8 WIN1252 WIN1255 ISO-8859-8;
do
        if [[ -n $ENCODING ]]; then
                export PGCLIENTENCODING=$ENCODING;
                break
        else
                echo 'Invalid encoding.'
        fi
done

The export PGCLIENTENCODING=$ENCODING; statement does the trick.

Checking Hebrew encoding tables and page tables, you can see that ISO-8859-8 and Windows-1255 have no mapping for 0x9f.

The data you are trying to convert could be based on the older Codepage 862, a code page for Hebrew under DOS. Codepage 862 maps the code 0x9f to the unicode character "LATIN SMALL LETTER F WITH HOOK", 0x0192.

You can investigate similar "random" errors, and decide on mapping for the non-windows-1255 codes in the data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow