Huge string replace in JavaScript?

https://stackoverflow.com/questions/18222665

24-06-2022
|

Question

I've got a small JavaScript application that will parse files the user drops into the browser. Recently I've discovered an issue with some non-english characters. The file types that are dropped on here are using the Windows-1252 character set, so characters such as ñ, are actually coming through as Ã± and I must convert them all to the proper characters.

For example, I get SeÃ±or which should be Señor in Spanish.

I've found an extremely useful website with the collection of the characters, and their counterparts that I need to convert to.

I've condensed that down into two JavaScript arrays:

var toReplace = ["Ã€", "Ã", "Ã‚", "Ãƒ", "Ã„", "Ã…", "Ã†", "Ã‡", "Ãˆ", "Ã‰", "ÃŠ", "Ã‹", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ã‘", "Ã’", "Ã“", "Ã”", "Ã•", "Ã–", "Ã—", "Ã˜", "Ã™", "Ãš", "Ã›", "Ãœ", "Ã", "Ãž", "ÃŸ", "Ã", "Ã¡", "Ã¢", "Ã£", "Ã¤", "Ã¥", "Ã¦", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã«", "Ã¬", "Ã", "Ã®", "Ã¯", "Ã°", "Ã±", "Ã²", "Ã³", "Ã´", "Ãµ", "Ã¶", "Ã·", "Ã¸", "Ã¹", "Ãº", "Ã»", "Ã¼", "Ã½", "Ã¾", "Ã¿"];
var replaceWith = ["À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"];

What would be the most efficient way to replace all characters from a paragraph in toReplace with it's counterpart (same index) in replaceWith?

I'm hoping this won't be too loop-heavy since it's not uncommon to drop over 100 files into this application that already does some heavy looping & parsing.

Perhaps there is a better way to do this instead of keeping these characters in arrays?

EDIT - I just realized I might need to replace with the unicode eqivilent instead. Here's an array of the unicode characters in the same order:

var unicodeReplaceWith= ["\u00C0", "\u00C1", "\u00C2", "\u00C3", "\u00C4", "\u00C5", "\u00C6", "\u00C7", "\u00C8", "\u00C9", "\u00CA", "\u00CB", "\u00CC", "\u00CD", "\u00CE", "\u00CF", "\u00D0", "\u00D1", "\u00D2", "\u00D3", "\u00D4", "\u00D5", "\u00D6", "\u00D7", "\u00D8", "\u00D9", "\u00DA", "\u00DB", "\u00DC", "\u00DD", "\u00DE", "\u00DF", "\u00E0", "\u00E1", "\u00E2", "\u00E3", "\u00E4", "\u00E5", "\u00E6", "\u00E7", "\u00E8", "\u00E9", "\u00EA", "\u00EB", "\u00EC", "\u00ED", "\u00EE", "\u00EF", "\u00F0", "\u00F1", "\u00F2", "\u00F3", "\u00F4", "\u00F5", "\u00F6", "\u00F7", "\u00F8", "\u00F9", "\u00FA", "\u00FB", "\u00FC", "\u00FD", "\u00FE", "\u00FF"];

Solution

I don't know much about speed in JavaScript, or why this can't be configured correctly on the server, but here's one way to do it.

Interactive Demo

First we turn everything into an object, so we can look up translations.

var map = {};
for (var i=0; i<toReplace.length; i++) {
  map[toReplace[i]] = replaceWith[i];
}

Then we join our keys into a regular expression
^{(note: they must be sorted longest-first, code in the demo).}

var expression = new RegExp(toReplace.join("|"), "g");

In the replace function, we can subsitute matches for results. This is as simple as looking them up in our map.

function doReplace(source) {
  return source.replace(expression, function(m) {
    return map[m];
  });
}

var result = doReplace("SeÃ±or");

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow