Question

Given the following HTML:

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

I'd like to get the values inside the <span> elements. I'd also like to get the value of the class attribute on the <span> elements.

Ideally I could just run some HTML through a function and get back a dictionary of extracted entities (based on the <span> parsing defined above).

The above code is a snippet from a larger source HTML file, which fails to pare with an XML parser. So I'm looking for a possible regular expression to help extract the information of interest.

Was it helpful?

Solution

Use this tool (free): http://www.radsoftware.com.au/regexdesigner/

Use this Regex:

"<span[^>]*>(.*?)</span>"

The values in Group 1 (for each match) will be the text that you need.

In C# it will look like:

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

Ammended to answer comment:

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

OTHER TIPS

Assuming that you have no nested span tags, the following should work:

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

I only did some basic testing on it, but it'll match the class of the span tag (if it exists) along with the data until the tag is closed.

I strongly advise you to use a real HTML or XML parser for this instead. You cannot reliably parse HTML or XML with regular expressions--the most you can do is come close, and the closer you get, the more convoluted and time-consuming your regex will be. If you have a large HTML file to parse, it's highly likely to break any simple regex pattern.

Regex like <span[^>]*>(.*?)</span> will work on your example, but there's a LOT of XML-valid code that's difficult or even impossible to parse with regex (for example, <span>foo <span>bar</span></span> will break the above pattern). If you want something that's going to work on other HTML samples, regex isn't the way to go here.

Since your HTML code isn't XML-valid, consider the HTML Agility Pack, which I've heard is very good.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top