How to remove accents and tilde in a C++ std::string

https://stackoverflow.com/questions/144761

02-07-2019
|

Question

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.

I'm using this code:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find_first_of(strMine);
    while (found!=std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,strMine.length());
        toReplace.insert(found,strAux);
        found=toReplace.find_first_of(strMine,found+1);
    }
}

Where dictionary is a map like this (with more entries):

dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );

and toReplace strings is:

std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";

I obviously must be missing something. I can't figure it out. Is there any library I can use?.

Thanks,

Solution

First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.

What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.

That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter

NFD; [:M:] remove; NFC

as “Compound 1” and click transform.

(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)

OTHER TIPS

I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?

I definitely think you should look into the root of the problem. That is, look for a solution that will allow you to support characters encoded in Unicode or for the user's locale.

That being said, your problem is that you're dealing with multi-character strings. There is std::wstring but I'm not sure I'd use that. For one thing, wide characters aren't meant to handle variable width encodings. This hole goes deep, so I'll leave it at that.

Now, as for the rest of your code, it is error prone because you mix the looping logic with translation logic. Thus, at least two kinds of bugs can occur: translation bugs and looping bugs. Do use the STL, it can help you a lot with the looping part.

The following is a rough solution for replacing characters in a string.

main.cpp:

#include <iostream>
#include <string>
#include <iterator>
#include <algorithm>
#include "translate_characters.h"

using namespace std;

int main()
{
    string text;
    cin.unsetf(ios::skipws);
    transform(istream_iterator<char>(cin), istream_iterator<char>(),
              inserter(text, text.end()), translate_characters());
    cout << text << endl;
    return 0;
}

translate_characters.h:

#ifndef TRANSLATE_CHARACTERS_H
#define TRANSLATE_CHARACTERS_H

#include <functional>
#include <map>

class translate_characters : public std::unary_function<const char,char> {
public:
    translate_characters();
    char operator()(const char c);

private:
    std::map<char, char> characters_map;
};

#endif // TRANSLATE_CHARACTERS_H

translate_characters.cpp:

#include "translate_characters.h"

using namespace std;

translate_characters::translate_characters()
{
    characters_map.insert(make_pair('e', 'a'));
}

char translate_characters::operator()(const char c)
{
    map<char, char>::const_iterator translation_pos(characters_map.find(c));
    if( translation_pos == characters_map.end() )
        return c;
    return translation_pos->second;
}

You might want to check out the boost (http://www.boost.org/) library.

It has a regexp library, which you could use. In addition it has a specific library that has some functions for string manipulation (link) including replace.

Try using std::wstring instead of std::string. UTF-16 should work (as opposed to ASCII).

If you can (if you're running Unix), I suggest using the tr facility for this: it's custom-built for this purpose. Remember, no code == no buggy code. :-)

Edit: Sorry, you're right, tr doesn't seem to work. How about sed? It's a pretty stupid script I've written, but it works for me.

#!/bin/sed -f
s/á/a/g;
s/é/e/g;
s/í/i/g;
s/ó/o/g;
s/ú/u/g;
s/ñ/n/g;

I could not link the ICU libraries but I still think it's the best solution. As I need this program to be functional as soon as possible I made a little program (that I have to improve) and I'm going to use that. Thank you all for for suggestions and answers.

Here's the code I'm gonna use:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find(strMine);
    while (found != std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,2);
        toReplace.insert(found,strAux);
        found=toReplace.find(strMine,found+1);
    }
}

I will change it next time I have to turn my program in for correction (in about 6 weeks).

    /// <summary>
    /// 
    /// Replace any accent and foreign character by their ASCII equivalent.
    /// In other words, convert a string to an ASCII-complient string.
    /// 
    /// This also get rid of special hidden character, like EOF, NUL, TAB and other '\0', except \n\r
    /// 
    /// Tests with accents and foreign characters:
    /// Before: "äæǽaeöœoeüueÄAeÜUeÖOeÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶАAàáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặаaБBбbÇĆĈĊČCçćĉċčcДDдdÐĎĐΔDjðďđδdjÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭEèéêëēĕėęěέεẽẻẹềếễểệеэeФFфfĜĞĠĢΓГҐGĝğġģγгґgĤĦHĥħhÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫIìíîïĩīĭǐįıηήίιϊỉịиыїiĴJĵjĶΚКKķκкkĹĻĽĿŁΛЛLĺļľŀłλлlМMмmÑŃŅŇΝНNñńņňŉνнnÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢОOòóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợоoПPпpŔŖŘΡРRŕŗřρрrŚŜŞȘŠΣСSśŝşșšſσςсsȚŢŤŦτТTțţťŧтtÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУUùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựуuÝŸŶΥΎΫỲỸỶỴЙYýÿŷỳỹỷỵйyВVвvŴWŵwŹŻŽΖЗZźżžζзzÆǼAEßssĲIJĳijŒOEƒf'ξksπpβvμmψpsЁYoёyoЄYeєyeЇYiЖZhжzhХKhхkhЦTsцtsЧChчchШShшshЩShchщshchЪъЬьЮYuюyuЯYaяya"
    /// After:  "aaeooeuueAAeUUeOOeAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaaaaaaaaaaaaBbCCCCCCccccccDdDDjddjEEEEEEEEEEEEEEEEEEeeeeeeeeeeeeeeeeeeFfGGGGGgggggHHhhIIIIIIIIIIIIIiiiiiiiiiiiiJJjjKKkkLLLLllllMmNNNNNnnnnnOOOOOOOOOOOOOOOOOOOOOOooooooooooooooooooooooPpRRRRrrrrSSSSSSssssssTTTTttttUUUUUUUUUUUUUUUUUUUUUUUUuuuuuuuuuuuuuuuuuuuuuuuYYYYYYYYyyyyyyyyVvWWwwZZZZzzzzAEssIJijOEf'kspvmpsYoyoYeyeYiZhzhKhkhTstsChchShshShchshchYuyuYaya"
    /// 
    /// Tests with invalid 'special hidden characters':
    /// Before: "\0\0\000\0000Bj��rk�\'\"\\\0\a\b\f\n\r\t\v\u0020���oacu\'\\\'te�"
    /// After:  "00000Bjrk'\"\\\n\r oacu'\\'te"
    /// 
    /// </summary>
    private string Normalize(string StringToClean)
    {
        string normalizedString = StringToClean.Normalize(NormalizationForm.FormD);
        StringBuilder Buffer = new StringBuilder(StringToClean.Length);

        for (int i = 0; i < normalizedString.Length; i++)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(normalizedString[i]) != UnicodeCategory.NonSpacingMark)
            {
                Buffer.Append(normalizedString[i]);
            }
        }

        string PreAsciiCompliant = Buffer.ToString().Normalize(NormalizationForm.FormC);
        StringBuilder AsciiComplient = new StringBuilder(PreAsciiCompliant.Length);

        foreach (char character in PreAsciiCompliant)
        {
            //Reject all special characters except \n\r (Carriage-Return and Line-Feed). 
            //Get rid of special hidden character, like EOF, NUL, TAB and other '\0'
            if (((int)character >= 32 && (int)character < 127) || ((int)character == 10 || (int)character == 13)) 
            {
                AsciiComplient.Append(character);
            }
        }
        return AsciiComplient.ToString().Trim(); // Remove spaces at start and end of string if any
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow