Question

I want to use Rebol 3 to read a file in Latin1 and convert it to UTF-8. Is there a built-in function I can use, or some external library? Where I can find it?

Was it helpful?

Solution

Rebol has an invalid-utf? function that scours a binary value for a byte that is not part of a valid UTF-8 sequence. We can just loop until we've found and replaced all of them, then convert our binary value to a string:

latin1-to-utf8: function [binary [binary!]][
    mark: :binary
    while [mark: invalid-utf? mark][
        change/part mark to char! mark/1 1
    ]
    to string! binary
]

This function modifies the original binary. We can create a new string instead that leaves the binary value intact:

latin1-to-utf8: function [binary [binary!]][
    mark: :binary
    to string! rejoin collect [
        while [mark: invalid-utf? binary][
            keep copy/part binary mark  ; keeps the portion up to the bad byte
            keep to char! mark/1        ; converts the bad byte to good bytes
            binary: next mark           ; set the series beyond the bad byte
        ]
        keep binary                     ; keep whatever is remaining
    ]
]

Bonus: here's a wee Rebmu version of the above—rebmu/args snippet #{DECAFBAD} where snippet is:

; modifying
IUgetLOAD"invalid-utf?"MaWT[MiuM][MisMtcTKm]tsA

; copying
IUgetLOAD"invalid-utf?"MaTSrjCT[wt[MiuA][kp copy/partAmKPtcFm AnxM]kpA]

OTHER TIPS

Here's a version that should be a bit faster, and at least use less memory.

latin1-to-utf8: func [
    "Transcodes a Latin-1 encoded string to UTF-8"
    bin [binary!] "Bytes of Latin-1 data"
] [
    to binary! head collect/into [
        foreach b bin [
            keep to char! b
        ]
    ] make string! length? bin
]

It takes advantage of Latin-1 characters having the same numeric values as the corresponding Unicode codepoints. If you wanted to convert from another character set for which that isn't the case, you can do a calculation on the b to remap the characters.

It uses less memory and is faster for a variety of reasons:

  • Normally, collect creates a block. We use collect/into and pass it a string as a target. Strings use less memory than blocks of integers or characters.
  • We preallocate the string to the length of the input data, which saves on reallocations.
  • We let Rebol's native code convert the characters rather than doing our own math.
  • There's less code in the loop, so it should run faster.

This method still loads the file into memory all at once, and still generates an intermediate value to store the results, but at least the intermediate value is smaller. Maybe this will let you process larger files.

If the reason you need it to be UTF-8 is that you need to process the file as a string in Rebol, just skip the to binary! and return the string as-is. Or you can just process the binary source data, just convert the bytes in the binary by using to char! on each one as you go.

Nothing built in at the moment, sorry. Here's a straightforward implementation of Latin-1 to UTF-8 conversion which I wrote and used with Rebol 3 a while back:

latin1-to-utf8: func [
    "Transcodes a Latin-1 encoded string to UTF-8"
    bin [binary!] "Bytes of Latin-1 data"
] [
    to-binary collect [foreach b bin [keep to-char b]]
] 

Note: this code is optimised for legibility, and not in any way for performance. (From a performance perspective, it's outright stupid. You have been warned.)

Update: Incorporated @BrianH's neat "Latin-1 byte values correspond to Unicode codepoints" optimisation, which makes the above collapse to a one-liner (and mildly less stupid at the same time). Still. for a more optimised version regarding memory usage, see @BrianH's nice answer.

latin1-to-utf8: func [
    "Transcodes bin as a Latin-1 encoded string to UTF-8"
    bin [binary!] "Bytes of Latin-1 data"
    /local t
] [
    t: make string! length? bin
    foreach b bin [append t to char! b ]
    t
]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top