Replace Unicode numeral subscript or superscript with plain numeral

https://stackoverflow.com/questions/9503565

14-11-2019
|

سؤال

How do I replace a Unicode numeral subscript or superscript (eg, ₂) with the corresponding numeral (ie, 2) using regular expressions? I can of course replace each of them separately, but that is ten lines of code...

I am implementing this in Perl but that should not really matter.

المحلول

Here from the unisupers script is a Perl function to convert to Unicode superscripts:

sub convert_to_superscripts (_) {
   my $string = $_[0];
   $string =~ tr[+−=()0123456789AaÆᴂɐɑɒBbcɕDdðEeƎəɛɜɜfGgɡɣhHɦIiɪɨᵻɩjJʝɟKklLʟᶅɭMmɱNnɴɲɳŋOoɔᴖᴗɵȢPpɸrRɹɻʁsʂʃTtƫUuᴜᴝʉɥɯɰʊvVʋʌwWxyzʐʑʒꝯᴥβγδθφχнნʕⵡ]
                [⁺⁻⁼⁽⁾⁰¹²³⁴⁵⁶⁷⁸⁹ᴬᵃᴭᵆᵄᵅᶛᴮᵇᶜᶝᴰᵈᶞᴱᵉᴲᵊᵋᶟᵌᶠᴳᵍᶢˠʰᴴʱᴵⁱᶦᶤᶧᶥʲᴶᶨᶡᴷᵏˡᴸᶫᶪᶩᴹᵐᶬᴺⁿᶰᶮᶯᵑᴼᵒᵓᵔᵕᶱᴽᴾᵖᶲʳᴿʴʵʶˢᶳᶴᵀᵗᶵᵁᵘᶸᵙᶶᶣᵚᶭᶷᵛⱽᶹᶺʷᵂˣʸᶻᶼᶽᶾꝰᵜᵝᵞᵟᶿᵠᵡᵸჼˤⵯ];
   return $string;
}

And from the unisubs script is one for subscripts:

sub convert_to_subscripts (_) {
   my $string = $_[0];
   $string =~ tr[+−=()0123456789aeəhijklmnoprstuvxβγρφχ]
                [₊₋₌₍₎₀₁₂₃₄₅₆₇₈₉ₐₑₔₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓᵦᵧᵨᵩᵪ];
   return $string;
}

You just have to go the other way.

Another and simpler approach is simply to use the k-compat normalizations, which just return the base characters instead of their upper/lower versions. I haven’t checked these to see that they are all the inverses of the functions above. You can play with them using the nfkd and nfkc scripts.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow