Question

I am writing a web application that requires friendly urls, but I'm not sure how to deal with non 7bit ASCII characters. I don't want to replace accented characters with URL encoded entities either. Is there a C# method that allows this sort of conversion or do I need to actually map out every single case I want to handle?

Was it helpful?

Solution

I don't know how to do it in C#, but the magic words you want are "Unicode decomposition". There's a standard way to break down composed characters like "é", and then you should be able to just filter out the non-ASCII ones.

Edit: this might be what you're looking for.

OTHER TIPS

Use UTF-8:

Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. — RFC 3986

There is something similar on: URL Routing: Handling Spaces and Illegal Characters When Creating Friendly URLs

Nevertheless, I don't recommend auto conversion. Some words can change meaning when doing these type of changing. You can turn a nice word into an inappropriate word.

This link might help: http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

private string LatinToAscii(string InString)
{
string newString = string.Empty, charString;
char ch;
int charsCopied;

for (int i = 0; i < InString.Length; i++)
{
    charString = InString.Substring(i, 1);
    charString = charString.Normalize(NormalizationForm.FormKD);
    // If the character doesn't decompose, leave it as-is

    if (charString.Length == 1)
        newString += charString;
    else
    {
        charsCopied = 0;
        for (int j = 0; j < charString.Length; j++)
        {
            ch = charString[j];
            // If the char is 7-bit ASCII, add

            if (ch < 128)
            {
                newString += ch;
                charsCopied++;
            }
        }
        /* If we've decomposed non-ASCII, give it back
         * in its entirety, since we only mean to decompose
         * Latin chars.
        */
        if (charsCopied == 0)
            newString += InString.Substring(i, 1);
    }
}
return newString;
}

Ok -- there are some good answers here. Those methods would work. However, I have to question your basic premise. I presume that these values that you are discussing are basically to be querystring parameters, yes? That's the most common reason to have to filter out special characters.

For two or three years, I used a string encoding/decoding approach to pass stuff like this through querystring. There were always intermittent problems, because -- darn it -- there are just so many different possible special characters, and issues in one browser vs another, etc. Our methods weren't as sophisticated as those outlined here, but still. In 2005, during a rewrite of much of the system I was working on, we decided to move to only ever passing id values through querystring. That approach has worked extremely well, and I can't think of any drawbacks to it. If you have a database back-end, you already have an id attached to pretty much every string, anyway. If this is for searches or the like, you can always send it via form post -- or you can use an AJAX solution that doesn't require you to load another page in the first place.

Those methods aren't going to be the best for every situation -- there is no magic bullet here any more than anywhere else -- but this approach has been simple and very functional for me and my team, and so I think it's something for you to at least consider.

well there's an easy why I think, there are not much of these characters, you can replace those in the string very easy by using Replace() method of the string class.

http://Montréal.com

(copy/paste in browser, it works?)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top