如何使用REGEX从HTML提取值？

https://stackoverflow.com/questions/5327503

26-10-2019
|

题

给定以下HTML：

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

我想在内部获得值  元素。我也想获得 class 属性  元素。

理想情况下，我只能通过功能运行一些HTML，然后恢复提取实体的字典（基于  解析上述）。

上面的代码是来自较大源HTML文件的摘要，该文件未能使用XML解析器进行pare。因此，我正在寻找可能的正则表达式，以帮助提取感兴趣的信息。

解决方案

使用此工具（免费）：http://www.radsoftware.com.au/regexdesigner/

使用此正直：

"<span[^>]*>(.*?)</span>"

第1组（每场比赛）中的值将是您需要的文本。

在C＃中看起来像：

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

提示回答评论：

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

其他提示

假设你没有嵌套跨度标签，以下内容应起作用：

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

我只对其进行了一些基本测试，但是它将与数据（如果存在的话）和数据匹配，直到关闭标签为止。

我强烈建议您为此使用真实的HTML或XML解析器。您不能用正则表达式可靠地解析HTML或XML- 您能做的最大的事情就是接近，您获得的距离越近，越多地且耗时您的正则是。如果您有一个大的HTML文件可以解析，则很可能会破坏任何简单的正则模式。

正则是 <span[^>]*>(.*?) 会在您的示例上工作，但是有很多XML-VALID代码很难或不可能用Regex解析（例如， foo bar 将打破上述模式）。如果您想要在其他HTML样品上可以使用的东西，那么Regex不是这里的方法。

由于您的HTML代码不是XML-VALID，请考虑 HTML敏捷包, ，我听说的很好。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow