How to convert large UTF-8 strings into ASCII?
-
08-07-2019 - |
Question
I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.
How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any platform/framework/library)
Edit: I understand that the ASCII representation will not look correct and would be larger (in terms of bytes) than its UTF-8 counterpart, since its an encoded form of the UTF-8 original.
Solution
You could use an ASCII-only version of Douglas Crockford's json2.js quote function. Which would look like this:
var escapable = /[\\\"\x00-\x1f\x7f-\uffff]/g,
meta = { // table of character substitutions
'\b': '\\b',
'\t': '\\t',
'\n': '\\n',
'\f': '\\f',
'\r': '\\r',
'"' : '\\"',
'\\': '\\\\'
};
function quote(string) {
// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.
escapable.lastIndex = 0;
return escapable.test(string) ?
'"' + string.replace(escapable, function (a) {
var c = meta[a];
return typeof c === 'string' ? c :
'\\u' + ('0000' + a.charCodeAt(0).toString(16)).slice(-4);
}) + '"' :
'"' + string + '"';
}
This will produce a valid ASCII-only, javascript-quoted of the input string
e.g. quote("Doppelgänger!")
will be "Doppelg\u00e4nger!"
To revert the encoding you can just eval the result
var encoded = quote("Doppelgänger!");
var back = JSON.parse(encoded); // eval(encoded);
OTHER TIPS
Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.
UTF-8 can represent any unicode character - ASCII cannot.
As others have said, you can't convert UTF-8 text/plain into ASCII text/plain without dropping data.
You could convert UTF-8 text/plain into ASCII someother/format. For instance, HTML lets any character in UTF-8 be representing in an ASCII data file using character references.
If we continue with that example, in JavaScript, charCodeAt could help with converting a string to a representation of it using HTML character references.
Another approach is taken by URLs, and implemented in JS as encodeURIComponent.
Your requirement is pretty strange.
Converting UTF-8 into ASCII would loose all information about Unicode codepoints > 127 (i.e. everything that's not in ASCII).
You could, however try to encode your Unicode data (no matter what source encoding) in an ASCII-compatible encoding, such as UTF-7. This would mean that the data that is produced could legally be interpreted as ASCII, but it is really UTF-7.
If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.
One way is to use base-64 encoding (example in C#):
string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);
If you want the string encoded as ASCII data:
// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);
Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?
First can be done in a loop checking for values > 128 and replacing them.
If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();
It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.
Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question
Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.
Here is a function to convert UTF8 accents to ASCII Accents (àéèî etc) If there is an accent in the string it's converted to %239 for exemple Then on the other side, I parse the string and I know when there is an accent and what is the ASCII char.
I used it in a javascript software to send data to a microcontroller that works in ASCII.
convertUtf8ToAscii = function (str) {
var asciiStr = "";
var refTable = { // Reference table Unicode vs ASCII
199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 231: 135, 234: 136, 235: 137, 232: 138,
239: 139, 238: 140, 236: 141, 196: 142, 201: 144, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151
};
for(var i = 0; i < str.length; i++){
var ascii = refTable[str.charCodeAt(i)];
if (ascii != undefined)
asciiStr += "%" +ascii;
else
asciiStr += str[i];
}
return asciiStr;
}
An implementation of the quote()
function might do what you want.
My version can be found here
You can use eval()
to reverse the encoding:
var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);