Question

I have made a straightforward implementation conforming to the W3 specification. Here I simply hold the different sets of legal characters (legal start chars differ from following chars) and use string.Contains. But the sets of legal characters are surprisingly (to me anyway) large, and just checking a character at a time of the candidate string becomes a tad expensive.

This isn't really an issue at the moment, as I need to validate a few strings once (taking milliseconds) per execution of a batch (taking seconds, minutes or even hours), but I'm curious to know what others will suggest.

Here's my straightforward implementation:

using System;
using System.Text;
using Project.Common; // Guard

namespace Project.Common.XmlUtilities
{
    static public class XmlUtil
    {
        static public bool IsLegalElementName(string localName)
        {
            Guard.ArgumentNotNull(localName, "localName");
            if (localName == "") 
                return false;

            if (NameStartChars.IndexOf(localName[0]) == -1)
                return false;

            for (int i = 1; i < localName.Length; i++)
                if (NameChars.IndexOf(localName[i]) == -1)
                    return false;

            return true;
        }


        // See W3 spec at http://www.w3.org/TR/REC-xml/#NT-NameStartChar.
        static public readonly string NameStartChars = AZ.ToLower() + AZ + ":_" + GetStringFromCharRanges(0xC0, 0xD6, 0xD8, 0xF6, 0xF8, 0x2FF, 0x370, 0x37D, 0x37F, 0x1FFF, 0x200C, 0x200D, 0x2070, 0x218F, 0x2C00, 0x2FEF, 0x3001, 0xD7FF, 0xF900, 0xFDCF, 0xFDF0, 0xFFFD, 0x10000, 0xEFFFF);

        // See W3 spec at http://www.w3.org/TR/REC-xml/#NT-NameChar.
        static public readonly string NameChars = NameStartChars + "-.0123456789" + char.ConvertFromUtf32(0xB7) + GetStringFromCharRanges(0x0300, 0x036F, 0x203F, 0x2040);

        public const string AZ = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

        // Hacky but convenient: alternating low-high unicode points specifies multiple ranges, e.g. 0-5 and 10-12 would be 0, 5, 10, 12.
        static string GetStringFromCharRanges(params int[] lowHigh)
        {
            var sb = new StringBuilder();
            for (int i = 0; i < lowHigh.Length; i += 2)
            {
                int low = lowHigh[i];
                int high = lowHigh[i + 1];
                for (int ci=low; ci < high; ci++)
                    sb.Append(char.ConvertFromUtf32(ci));
            }
            return sb.ToString();
        }
    }
}

Although I haven't bothered to build it I reckon creating a sorted list once, in a type initializer, and binary search the lists (instead of linearly search with string.Contains) to check each character would strike a good balance of space, time and complexity. But perhaps you have other (better!) ideas?

Was it helpful?

Solution

There exists a static string VerifyName(string name) function, but it throws an exception for invalid names.

I would still prefer to use this:

try
{
    XmlConvert.VerifyName(name);
    return true;
}
catch
{
   return false;
}

OTHER TIPS

I would go for a regex or simply try to create a XElement with the name in question (if there's an exception the name is invalid...)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top