Squeak Monticello character-encoding

Question 1

Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.

The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.

If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.

Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.

MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.

The snippet that you should use is rather:

MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'

Question 2

Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.

It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.

Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code. You don't need to convert non 7bit characters to HTML entities. You can use

Character value: 228

for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do

$ä asciiValue => 228

I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.