Melhor maneira de detectar XML?

https://stackoverflow.com/questions/891223

23-08-2019
|

Pergunta

Atualmente, eu tenho o seguinte código c # para extrair um valor fora do texto. Se seu XML, eu quero que o valor dentro dele - caso contrário, se não for XML, ele pode simplesmente devolver o texto em si

String data = "..."
try
{
    return XElement.Parse(data).Value;
}
catch (System.Xml.XmlException)
{
    return data;
}

Eu sei exceções são caros em C #, então eu queria saber se havia uma maneira melhor para determinar se o texto que eu estou lidando com é xml ou não?

Eu pensei de testes regex, mas eu não' ver isso como uma alternativa mais barata. Note, eu estou pedindo um menos caro método de fazer isso.

Solução

Você poderia fazer uma verificação preliminar para uma

(Free-mão escrita.)

// Has to have length to be XML
if (!string.IsNullOrEmpty(data))
{
    // If it starts with a < after trimming then it probably is XML
    // Need to do an empty check again in case the string is all white space.
    var trimmedData = data.TrimStart();
    if (string.IsNullOrEmpty(trimmedData))
    {
       return data;
    }

    if (trimmedData[0] == '<')
    {
        try
        {
            return XElement.Parse(data).Value;
        }
        catch (System.Xml.XmlException)
        {
            return data;
        }
    }
}
else
{
    return data;
}

Eu tinha originalmente o uso de um regex, mas Trim () [0] é idêntico ao que que regex faria.

Outras dicas

O código dado abaixo irá corresponder a todos os seguintes formatos XML:

<text />                             
<text/>                              
<text   />                           
<text>xml data1</text>               
<text attr='2'>data2</text>");
<text attr='2' attr='4' >data3 </text>
<text attr>data4</text>              
<text attr1 attr2>data5</text>

E aqui está o código:

public class XmlExpresssion
{
    // EXPLANATION OF EXPRESSION
    // <        :   \<{1}
    // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
    // >        :   >{1}
    // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
    // </       :   <{1}/{1}
    // text     :   \k<xmlTag>
    // >        :   >{1}
    // (\w|\W)* :   Matches attributes if any

    // Sample match and pattern egs
    // Just to show how I incrementally made the patterns so that the final pattern is well-understood
    // <text>data</text>
    // @"^\<{1}(?<xmlTag>\w+)\>{1}.*\<{1}/{1}\k<xmlTag>\>{1}$";

    //<text />
    // @"^\<{1}(?<xmlTag>\w+)\s*/{1}\>{1}$";

    //<text>data</text> or <text />
    // @"^\<{1}(?<xmlTag>\w+)((\>{1}.*\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

    //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
    // @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

    private const string XML_PATTERN = @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

    // Checks if the string is in xml format
    private static bool IsXml(string value)
    {
        return Regex.IsMatch(value, XML_PATTERN);
    }

    /// <summary>
    /// Assigns the element value to result if the string is xml
    /// </summary>
    /// <returns>true if success, false otherwise</returns>
    public static bool TryParse(string s, out string result)
    {
        if (XmlExpresssion.IsXml(s))
        {
            Regex r = new Regex(XML_PATTERN, RegexOptions.Compiled);
            result = r.Match(s).Result("${data}");
            return true;
        }
        else
        {
            result = null;
            return false;
        }
    }

}

código de chamada:

if (!XmlExpresssion.TryParse(s, out result)) 
    result = s;
Console.WriteLine(result);

Update: (post original está abaixo) Colin tem a brilhante idéia de mover o exterior regex instanciação das chamadas de modo que eles estão criados apenas uma vez. Heres o novo programa:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace ConsoleApplication3
{
    delegate String xmltestFunc(String data);

    class Program
    {
        static readonly int iterations = 1000000;

        private static void benchmark(xmltestFunc func, String data, String expectedResult)
        {
            if (!func(data).Equals(expectedResult))
            {
                Console.WriteLine(data + ": fail");
                return;
            }
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
                func(data);
            sw.Stop();
            Console.WriteLine(data + ": " + (float)((float)sw.ElapsedMilliseconds / 1000));
        }

        static void Main(string[] args)
        {
            benchmark(xmltest1, "<tag>base</tag>", "base");
            benchmark(xmltest1, " <tag>base</tag> ", "base");
            benchmark(xmltest1, "base", "base");
            benchmark(xmltest2, "<tag>ColinBurnett</tag>", "ColinBurnett");
            benchmark(xmltest2, " <tag>ColinBurnett</tag> ", "ColinBurnett");
            benchmark(xmltest2, "ColinBurnett", "ColinBurnett");
            benchmark(xmltest3, "<tag>Si</tag>", "Si");
            benchmark(xmltest3, " <tag>Si</tag> ", "Si" );
            benchmark(xmltest3, "Si", "Si");
            benchmark(xmltest4, "<tag>RashmiPandit</tag>", "RashmiPandit");
            benchmark(xmltest4, " <tag>RashmiPandit</tag> ", "RashmiPandit");
            benchmark(xmltest4, "RashmiPandit", "RashmiPandit");
            benchmark(xmltest5, "<tag>Custom</tag>", "Custom");
            benchmark(xmltest5, " <tag>Custom</tag> ", "Custom");
            benchmark(xmltest5, "Custom", "Custom");

            // "press any key to continue"
            Console.WriteLine("Done.");
            Console.ReadLine();
        }

        public static String xmltest1(String data)
        {
            try
            {
                return XElement.Parse(data).Value;
            }
            catch (System.Xml.XmlException)
            {
                return data;
            }
        }

        static Regex xmltest2regex = new Regex("^[ \t\r\n]*<");
        public static String xmltest2(String data)
        {
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            {
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || xmltest2regex.Match(data).Success)
                {
                    try
                    {
                        return XElement.Parse(data).Value;
                    }
                    catch (System.Xml.XmlException)
                    {
                        return data;
                    }
                }
            }
           return data;
        }

        static Regex xmltest3regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
        public static String xmltest3(String data)
        {
            Match m = xmltest3regex.Match(data);
            if (m.Success)
            {
                GroupCollection gc = m.Groups;
                if (gc.Count > 0)
                {
                    return gc["text"].Value;
                }
            }
            return data;
        }

        public static String xmltest4(String data)
        {
            String result;
            if (!XmlExpresssion.TryParse(data, out result))
                result = data;

            return result;
        }

        static Regex xmltest5regex = new Regex("^[ \t\r\n]*<");
        public static String xmltest5(String data)
        {
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            {
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || data.Trim()[0] == '<' || xmltest5regex.Match(data).Success)
                {
                    try
                    {
                        return XElement.Parse(data).Value;
                    }
                    catch (System.Xml.XmlException)
                    {
                        return data;
                    }
                }
            }
            return data;
        }
    }

    public class XmlExpresssion
    {
        // EXPLANATION OF EXPRESSION
        // <        :   \<{1}
        // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
        // >        :   >{1}
        // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
        // </       :   <{1}/{1}
        // text     :   \k<xmlTag>
        // >        :   >{1}
        // (\w|\W)* :   Matches attributes if any

        // Sample match and pattern egs
        // Just to show how I incrementally made the patterns so that the final pattern is well-understood
        // <text>data</text>
        // @"^\<{1}(?<xmlTag>\w+)\>{1}.*\<{1}/{1}\k<xmlTag>\>{1}$";

        //<text />
        // @"^\<{1}(?<xmlTag>\w+)\s*/{1}\>{1}$";

        //<text>data</text> or <text />
        // @"^\<{1}(?<xmlTag>\w+)((\>{1}.*\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
        // @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        private static string XML_PATTERN = @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";
        private static Regex regex = new Regex(XML_PATTERN, RegexOptions.Compiled);

        // Checks if the string is in xml format
        private static bool IsXml(string value)
        {
            return regex.IsMatch(value);
        }

        /// <summary>
        /// Assigns the element value to result if the string is xml
        /// </summary>
        /// <returns>true if success, false otherwise</returns>
        public static bool TryParse(string s, out string result)
        {
            if (XmlExpresssion.IsXml(s))
            {
                result = regex.Match(s).Result("${data}");
                return true;
            }
            else
            {
                result = null;
                return false;
            }
        }

    }


}

E aqui estão os novos resultados:

<tag>base</tag>: 3.667
 <tag>base</tag> : 3.707
base: 40.737
<tag>ColinBurnett</tag>: 3.707
 <tag>ColinBurnett</tag> : 4.784
ColinBurnett: 0.413
<tag>Si</tag>: 2.016
 <tag>Si</tag> : 2.141
Si: 0.087
<tag>RashmiPandit</tag>: 12.305
 <tag>RashmiPandit</tag> : fail
RashmiPandit: 0.131
<tag>Custom</tag>: 3.761
 <tag>Custom</tag> : 3.866
Custom: 0.329
Done.

Lá você tem. regex pré-compilados são o caminho a percorrer, e bastante eficiente para arrancar.

(Post original)

Eu remendada o seguinte programa para aferir os exemplos de código que foram fornecidos para esta resposta, para demonstrar o raciocínio para o meu post, bem como avaliar a velocidade das respostas privded.

Sem mais delongas, aqui está o programa.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace ConsoleApplication3
{
    delegate String xmltestFunc(String data);

    class Program
    {
        static readonly int iterations = 1000000;

        private static void benchmark(xmltestFunc func, String data, String expectedResult)
        {
            if (!func(data).Equals(expectedResult))
            {
                Console.WriteLine(data + ": fail");
                return;
            }
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
                func(data);
            sw.Stop();
            Console.WriteLine(data + ": " + (float)((float)sw.ElapsedMilliseconds / 1000));
        }

        static void Main(string[] args)
        {
            benchmark(xmltest1, "<tag>base</tag>", "base");
            benchmark(xmltest1, " <tag>base</tag> ", "base");
            benchmark(xmltest1, "base", "base");
            benchmark(xmltest2, "<tag>ColinBurnett</tag>", "ColinBurnett");
            benchmark(xmltest2, " <tag>ColinBurnett</tag> ", "ColinBurnett");
            benchmark(xmltest2, "ColinBurnett", "ColinBurnett");
            benchmark(xmltest3, "<tag>Si</tag>", "Si");
            benchmark(xmltest3, " <tag>Si</tag> ", "Si" );
            benchmark(xmltest3, "Si", "Si");
            benchmark(xmltest4, "<tag>RashmiPandit</tag>", "RashmiPandit");
            benchmark(xmltest4, " <tag>RashmiPandit</tag> ", "RashmiPandit");
            benchmark(xmltest4, "RashmiPandit", "RashmiPandit");

            // "press any key to continue"
            Console.WriteLine("Done.");
            Console.ReadLine();
        }

        public static String xmltest1(String data)
        {
            try
            {
                return XElement.Parse(data).Value;
            }
            catch (System.Xml.XmlException)
            {
                return data;
            }
        }

        public static String xmltest2(String data)
        {
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            {
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || new Regex("^[ \t\r\n]*<").Match(data).Success)
                {
                    try
                    {
                        return XElement.Parse(data).Value;
                    }
                    catch (System.Xml.XmlException)
                    {
                        return data;
                    }
                }
            }
           return data;
        }

        public static String xmltest3(String data)
        {
            Regex regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
            Match m = regex.Match(data);
            if (m.Success)
            {
                GroupCollection gc = m.Groups;
                if (gc.Count > 0)
                {
                    return gc["text"].Value;
                }
            }
            return data;
        }

        public static String xmltest4(String data)
        {
            String result;
            if (!XmlExpresssion.TryParse(data, out result))
                result = data;

            return result;
        }

    }

    public class XmlExpresssion
    {
        // EXPLANATION OF EXPRESSION
        // <        :   \<{1}
        // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
        // >        :   >{1}
        // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
        // </       :   <{1}/{1}
        // text     :   \k<xmlTag>
        // >        :   >{1}
        // (\w|\W)* :   Matches attributes if any

        // Sample match and pattern egs
        // Just to show how I incrementally made the patterns so that the final pattern is well-understood
        // <text>data</text>
        // @"^\<{1}(?<xmlTag>\w+)\>{1}.*\<{1}/{1}\k<xmlTag>\>{1}$";

        //<text />
        // @"^\<{1}(?<xmlTag>\w+)\s*/{1}\>{1}$";

        //<text>data</text> or <text />
        // @"^\<{1}(?<xmlTag>\w+)((\>{1}.*\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
        // @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        private const string XML_PATTERN = @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        // Checks if the string is in xml format
        private static bool IsXml(string value)
        {
            return Regex.IsMatch(value, XML_PATTERN);
        }

        /// <summary>
        /// Assigns the element value to result if the string is xml
        /// </summary>
        /// <returns>true if success, false otherwise</returns>
        public static bool TryParse(string s, out string result)
        {
            if (XmlExpresssion.IsXml(s))
            {
                Regex r = new Regex(XML_PATTERN, RegexOptions.Compiled);
                result = r.Match(s).Result("${data}");
                return true;
            }
            else
            {
                result = null;
                return false;
            }
        }

    }


}

E aqui estão os resultados. Cada um foi executado 1 milhão de vezes.

<tag>base</tag>: 3.531
 <tag>base</tag> : 3.624
base: 41.422
<tag>ColinBurnett</tag>: 3.622
 <tag>ColinBurnett</tag> : 16.467
ColinBurnett: 7.995
<tag>Si</tag>: 19.014
 <tag>Si</tag> : 19.201
Si: 15.567

Test 4 levou muito tempo, como 30 minutos mais tarde foi considerado muito lento. Para demonstrar como muito mais lento do que era, aqui é o mesmo teste executado apenas 1000 vezes.

<tag>base</tag>: 0.004
 <tag>base</tag> : 0.004
base: 0.047
<tag>ColinBurnett</tag>: 0.003
 <tag>ColinBurnett</tag> : 0.016
ColinBurnett: 0.008
<tag>Si</tag>: 0.021
 <tag>Si</tag> : 0.017
Si: 0.014
<tag>RashmiPandit</tag>: 3.456
 <tag>RashmiPandit</tag> : fail
RashmiPandit: 0
Done.

Extrapolando para um milhão de execuções, ela já teria tomado 3456 segundos, ou pouco mais de 57 minutos.

Este é um bom exemplo de por que regex complexos são uma má idéia se você está procurando código eficiente. No entanto, mostrou que regex simples ainda pode ser boa resposta em alguns casos - ou seja, o pequeno 'pré-teste' de xml em resposta colinBurnett criado um caso base potencialmente mais caro, (regex foi criado em caso 2), mas também uma pessoa muito mais curto caso, evitando a exceção.

Eu acho que uma forma perfeitamente aceitável de lidar com a sua situação (é provavelmente a maneira que eu lidar com isso também). Eu não poderia encontrar qualquer tipo de "XElement.TryParse (string)" no MSDN, assim que a maneira que você tem que iria funcionar muito bem.

Não há nenhuma maneira de validar que o texto é XML que não fazer algo como XElement.Parse. Se, por exemplo, o último fim-de ângulo suporte está faltando do campo de texto, então não é um XML válido, e é muito improvável que você vai manchar isso com RegEx ou análise de texto. Há um número de caracteres ilegais, seqüências ilegais etc que RegEx parsiing provavelmente irá perder.

Tudo o que você pode esperar para fazer é corte curto seus casos de falha.

Então, se você esperar para ver grandes quantidades de dados não-XML eo caso menos esperado é XML, em seguida, empregando RegEx ou substring pesquisas para detectar sinais de menor pode poupar-lhe um pouco de tempo, mas eu sugiro este é única útil se você estiver em lote processando grandes quantidades de dados dentro de um loop.

Se, em vez disso, este é analisar dados inseridos usuário de um formulário web ou um winforms app então eu acho que pagar o custo da exceção pode ser melhor do que gastar o esforço dev e teste assegurando que o seu código de atalho não gerar falsos positivos / negativos.

Não está claro onde você está recebendo o seu XML a partir de (arquivo, fluxo, caixa de texto ou em outro lugar), mas lembre-se que espaços em branco, comentários, marcas de ordem de bytes e outras coisas podem ficar no caminho de regras simples, como "deve começar com um <".

Por que é regex caro? Não lhe matar 2 coelhos com uma pedra (jogo e análise)?

Um exemplo simples análise de todos os elementos, ainda mais fácil se é apenas um elemento!

Regex regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
MatchCollection matches = regex.Matches(data);
foreach (Match match in matches)
{
    GroupCollection groups = match.Groups;
    string name = groups["tag"].Value;
    string value = groups["text"].Value;
    ...
}

Como observado por @JustEngland no comentário exceções que não são caros, um depurador interceptando-los pode levar tempo, mas normalmente eles estão com bom desempenho e boas práticas. Consulte Como caro são exceções em C #? .

A melhor maneira seria para rolar sua própria função estilo TryParse:

[System.Diagnostics.DebuggerNonUserCode]
static class MyXElement
{
    public static bool TryParse(string data, out XElement result)
    {
        try
        {
            result = XElement.Parse(data);
            return true;
        }
        catch (System.Xml.XmlException)
        {
            result = default(XElement);
            return false;
        }
    }
}

O atributo DebuggerNonUserCode faz com que o depurador ignorar a exceção capturada para simplificar a sua experiência de depuração.

Usado como este:

    static void Main()
    {
        var addressList = "line one~line two~line three~postcode";

        var address = new XElement("Address");
        var addressHtml = "<span>" + addressList.Replace("~", "<br />") + "</span>";

        XElement content;
        if (MyXElement.TryParse(addressHtml, out content))
            address.ReplaceAll(content);
        else
            address.SetValue(addressHtml);

        Console.WriteLine(address.ToString());
        Console.ReadKey();
    }
}

Eu teria preferido para criar um método de extensão para o TryParse, mas você não pode criar um estático chamado em um tipo em vez de uma instância.

Clue - tudo xml válido deve começar com "<?xml "

Você pode ter que lidar com as diferenças de conjuntos de caracteres, mas verificando ASCII, UTF-8 e unicode irá cobrir 99,5% do xml lá fora.

A forma como você sugere vai ser caro se você vai usá-lo em um loop onde a maioria dos s XML não são valied, Em caso de valied xml seu código vai funcionar como não há tratamento de exceção ... por isso, se na maioria dos casos o xml é valied ou você não usá-lo em loop, o código irá funcionar bem

Se você quiser saber se é válido, por que não usar o built-in .NetFX objeto em vez de escrever um do zero?

Espero que isso ajude,

Bill

Uma variação da técnica de Colin Burnett: você pode fazer um simples regex no início para ver se o texto começa com uma tag, em seguida, tentar analisá-lo. Provavelmente> 99% das cordas você vai lidar com isso começo com um elemento válido são XML. Dessa forma, você pode pular o processamento regex para XML válido full-blown e também pular o processamento baseado em exceção em quase todos os casos.

Algo como ^<[^>]+> provavelmente faria o truque.

Eu não sou exatamente certo se a sua exigência considera o formato de arquivo e como esta pergunta foi feita há muito tempo voltar & i acontecer a procurar uma coisa semelhante, eu gostaria que você para saber o que funcionou para mim, por isso, se qualquer um vem aqui isso pode ajudar :)

Podemos usar Path.GetExtension (filePath) e verificar se ele é XML, em seguida, usá-lo outro sábio fazer o que sempre é necessária

Como sobre isso, tome a sua string ou objeto e atirar em em um novo XDocument ou XElement. resolve tudo usando ToString ().

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow