Come estrarre i valori da HTML utilizzando RegEx?

https://stackoverflow.com/questions/5327503

26-10-2019
|

Domanda

Dato il seguente codice HTML:

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

Mi piacerebbe ottenere i valori all'interno degli elementi . Vorrei anche per ottenere il valore dell'attributo class sugli elementi .

Idealmente ho potuto solo eseguire alcuni HTML tramite una funzione e tornare un dizionario di entità estratte (basato sul  parsing definito sopra).

Il codice di cui sopra è un frammento da un file sorgente HTML più grande, che non riesce a sbucciare con un parser XML. Così sto cercando una possibile espressione regolare per aiuto estrarre le informazioni di interesse.

Soluzione

Utilizzare questo strumento (gratuito): http://www.radsoftware.com.au/regexdesigner/

Utilizzare questa Regex:

"<span[^>]*>(.*?)</span>"

I valori del gruppo 1 (per ogni partita) sarà il testo che avete bisogno.

In C # che sarà del tipo:

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

Ammended alla risposta commento:

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

Altri suggerimenti

Supponendo che non vi interessa avere annidato tag span, il seguente dovrebbe funzionare:

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

L'ho fatto solo un po 'di test di base su di esso, ma sarà corrispondere alla classe del tag span (se esiste) insieme con i dati fino a quando il tag viene chiuso.

I con forza consigliamo di utilizzare un vero e proprio HTML o parser XML per questo, invece. Non è possibile in modo affidabile parse HTML o XML con le espressioni regolari --il più che puoi fare è si avvicinano, e più ci si avvicina, il più contorto e che richiede tempo la vostra regex sarà. Se si dispone di un grande file HTML per analizzare, è altamente probabile per rompere qualsiasi semplice espressione regolare.

Regex come <span[^>]*>(.*?) lavorerà sul tuo esempio, ma c'è un sacco di codice XML valido che è difficile o addirittura impossibile per analizzare con regex (ad esempio, foo bar romperà il modello di cui sopra). Se si desidera qualcosa che sta andando a lavorare su altri campioni HTML, regex non è il modo di andare qui.

Dal momento che il codice HTML non è XML valido, si consideri il HTML Agility pacchetto , che ho sentito è molto buona.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow