Question

I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.

How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any platform/framework/library)

Edit: I understand that the ASCII representation will not look correct and would be larger (in terms of bytes) than its UTF-8 counterpart, since its an encoded form of the UTF-8 original.

Was it helpful?

Solution

You could use an ASCII-only version of Douglas Crockford's json2.js quote function. Which would look like this:

    var escapable = /[\\\"\x00-\x1f\x7f-\uffff]/g,
        meta = {    // table of character substitutions
            '\b': '\\b',
            '\t': '\\t',
            '\n': '\\n',
            '\f': '\\f',
            '\r': '\\r',
            '"' : '\\"',
            '\\': '\\\\'
        };

    function quote(string) {

// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.

        escapable.lastIndex = 0;
        return escapable.test(string) ?
            '"' + string.replace(escapable, function (a) {
                var c = meta[a];
                return typeof c === 'string' ? c :
                    '\\u' + ('0000' + a.charCodeAt(0).toString(16)).slice(-4);
            }) + '"' :
            '"' + string + '"';
    }

This will produce a valid ASCII-only, javascript-quoted of the input string

e.g. quote("Doppelgänger!") will be "Doppelg\u00e4nger!"

To revert the encoding you can just eval the result

var encoded = quote("Doppelgänger!");
var back = JSON.parse(encoded); // eval(encoded);

OTHER TIPS

Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.

UTF-8 can represent any unicode character - ASCII cannot.

As others have said, you can't convert UTF-8 text/plain into ASCII text/plain without dropping data.

You could convert UTF-8 text/plain into ASCII someother/format. For instance, HTML lets any character in UTF-8 be representing in an ASCII data file using character references.

If we continue with that example, in JavaScript, charCodeAt could help with converting a string to a representation of it using HTML character references.

Another approach is taken by URLs, and implemented in JS as encodeURIComponent.

Your requirement is pretty strange.

Converting UTF-8 into ASCII would loose all information about Unicode codepoints > 127 (i.e. everything that's not in ASCII).

You could, however try to encode your Unicode data (no matter what source encoding) in an ASCII-compatible encoding, such as UTF-7. This would mean that the data that is produced could legally be interpreted as ASCII, but it is really UTF-7.

If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.

One way is to use base-64 encoding (example in C#):

string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);

If you want the string encoded as ASCII data:

// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);

Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?

First can be done in a loop checking for values > 128 and replacing them.

If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();

It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.

Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question

Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.

Here is a function to convert UTF8 accents to ASCII Accents (àéèî etc) If there is an accent in the string it's converted to %239 for exemple Then on the other side, I parse the string and I know when there is an accent and what is the ASCII char.

I used it in a javascript software to send data to a microcontroller that works in ASCII.

convertUtf8ToAscii = function (str) {
    var asciiStr = "";
    var refTable = { // Reference table Unicode vs ASCII
        199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 231: 135, 234: 136, 235: 137, 232: 138,
        239: 139, 238: 140, 236: 141, 196: 142, 201: 144, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151
    };
    for(var i = 0; i < str.length; i++){
        var ascii = refTable[str.charCodeAt(i)];
        if (ascii != undefined)
            asciiStr += "%" +ascii;
        else
            asciiStr += str[i];
    }
    return asciiStr;
}

An implementation of the quote() function might do what you want. My version can be found here

You can use eval() to reverse the encoding:

var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top