¿Cómo extraer valores de html usando regex?

https://stackoverflow.com/questions/5327503

26-10-2019
|

Pregunta

Dado el siguiente HTML:

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

Me gustaría obtener los valores dentro del  elementos. También me gustaría obtener el valor del class atributo en el  elementos.

Idealmente, podría ejecutar un poco de HTML a través de una función y recuperar un diccionario de entidades extraídas (basadas en el  analizador definido arriba).

El código anterior es un fragmento de un archivo HTML de origen más grande, que no puede ampliar con un analizador XML. Así que estoy buscando una posible expresión regular para ayudar a extraer la información de interés.

Solución

Use esta herramienta (gratis):http://www.radsoftware.com.au/regexdesigner/

Usa este Regex:

"<span[^>]*>(.*?)</span>"

Los valores en el Grupo 1 (para cada coincidencia) serán el texto que necesita.

En C# se verá como:

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

Modificado para responder comentarios:

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

Otros consejos

Suponiendo que no tienes anidado Etiquetas de span, lo siguiente debería funcionar:

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

Solo hice algunas pruebas básicas, pero coincidirá con la clase de la etiqueta SPAN (si existe) junto con los datos hasta que la etiqueta esté cerrada.

yo fuertemente Aconsejarle que use un analizador HTML o XML real para esto. No puede analizar de manera confiable HTML o XML con expresiones regularesLa mayoría de las cosas que puedes hacer es acercarte, y cuanto más cerca te acerques, más complicado y lento será tu regex. Si tiene un archivo HTML grande para analizar, es muy probable que rompa cualquier patrón regular simple.

Como reglas como <span[^>]*>(.*?) Funcionará en su ejemplo, pero hay muchos códigos XML-Válido que es difícil o incluso imposible de analizar con Regex (por ejemplo, foo bar romperá el patrón anterior). Si quieres algo que va a funcionar en otras muestras HTML, Regex no es el camino a seguir aquí.

Dado que su código HTML no es XML-Válido, considere el Paquete de agilidad HTML, que he escuchado es muy bueno.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow