如何使用REGEX从HTML提取值?
-
26-10-2019 - |
题
给定以下HTML:
<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares.  This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion.  A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion.  The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>
我想在内部获得值 <span>
元素。我也想获得 class
属性 <span>
元素。
理想情况下,我只能通过功能运行一些HTML,然后恢复提取实体的字典(基于 <span>
解析上述)。
上面的代码是来自较大源HTML文件的摘要,该文件未能使用XML解析器进行pare。因此,我正在寻找可能的正则表达式,以帮助提取感兴趣的信息。
解决方案
使用此工具(免费):http://www.radsoftware.com.au/regexdesigner/
使用此正直:
"<span[^>]*>(.*?)</span>"
第1组(每场比赛)中的值将是您需要的文本。
在C#中看起来像:
Regex regex = new Regex("<span[^>]*>(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string val = m.Groups[1].Value;
//Do something with the value
}
}
提示回答评论:
Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string class = m.Groups[1].Value;
string val = m.Groups[2].Value;
//Do something with the class and value
}
}
其他提示
假设你没有 嵌套 跨度标签,以下内容应起作用:
/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/
我只对其进行了一些基本测试,但是它将与数据(如果存在的话)和数据匹配,直到关闭标签为止。
我 强烈 建议您为此使用真实的HTML或XML解析器。 您不能用正则表达式可靠地解析HTML或XML- 您能做的最大的事情就是接近,您获得的距离越近,越多地且耗时您的正则是。如果您有一个大的HTML文件可以解析,则很可能会破坏任何简单的正则模式。
正则是 <span[^>]*>(.*?)</span>
会在您的示例上工作,但是有很多XML-VALID代码很难或不可能用Regex解析(例如, <span>foo <span>bar</span></span>
将打破上述模式)。如果您想要在其他HTML样品上可以使用的东西,那么Regex不是这里的方法。
由于您的HTML代码不是XML-VALID,请考虑 HTML敏捷包, ,我听说的很好。
不隶属于 StackOverflow