syntax error in Regular expression to match link url
Question
I have the following method in some nemerle code:
private static getLinks(text : string) : array[string] {
def linkrx = Regex(@"<a\shref=['|\"](.*?)['|\"].*?>");
def m = linkrx.Matches(text);
mutable txmatches : array[string];
for (mutable i = 0; i < m.Count; ++i) {
txmatches[i] = m[i].Value;
}
txmatches
}
the problem is that the compiler for some reason is trying to parse the brackets inside the regex statement and its causing the program to not compile. If i remove the @, (which i was told to put there) i get an invalid escape character error on the "\s"
Heres the compiler output:
NCrawler.n:23:21:23:22: ←[01;31merror←[0m: when parsing this `(' brace group
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
NCrawler.n:22:57:22:58: ←[01;31merror←[0m: when parsing this `{' brace group
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
NCrawler.n:8:1:8:2: ←[01;31merror←[0m: when parsing this `{' brace group
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
NCrawler.n:23:38:23:39: ←[01;31merror←[0m: unexpected closing bracket `]'
(line 23 is the line with the regex code on it)
What should I do?
Solution
I don't know Nemerle, but it seems like using @
disables all escapes, including the escape for the "
.
Try one of these:
def linkrx = Regex("<a\\shref=['\"](.*?)['\"].*?>");
def linkrx = Regex(@"<a\shref=['""](.*?)['""].*?>");
def linkrx = Regex(@"<a\shref=['\x22](.*?)['\x22].*?>");
OTHER TIPS
I'm not Nemerle programmer but i know that yous shoud ALWAYS use XML parser for XML based data and not regexps.
I guess someone has created DOM or XPath library for Nemerle so you can access either
//a[@href] via XPath or something like a.href.value via DOM.
That current regexp doesn't like for example
<a class="foo" href="something">bar</a>
I didn't test this but it should be more like it
/<a\s.+?href=['|\"]([^'\">]+)['|\"].+?>/i
The problem is with the quotation marks, not the brackets. In Nemerle, as in C#, you escape a quotation mark with another quotation mark, not a backslash.
@"<a\shref=['""](.*?)['""].*?>"
EDIT: Note as well that you don't need the pipe inside the square brackets; the contents are treated as a set of characters (or ranges of characters), with the OR being implied.