Question

I am trying to get the href link out of the following HTML code using mshtml in C# (WPF).

<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&amp;sig=b0dbd522380a21007d8c375iuc583f46a90365d9&amp;iid=am-130280753913638201274485430&amp;ac=1&amp;uid=1284488216&amp;nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>

I have tried using the following code to make this work by using mshtml in C# (WPF) but I have failed miserably.

HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;

Can someone please help me to get this to work.

Était-ce utile?

La solution

Use this:

var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&amp;sig=b0dbd522380a21007d8c375iuc583f46a90365d9&amp;iid=am-130280753913638201274485430&amp;ac=1&amp;uid=1284488216&amp;nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";

var address = ex.Match(tag).Groups[1].ToString();

But you should extend it with checks because for instance Groups[1] could be out of range.

In your example

HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();

will match the first href="...". Or you select all occurrences:

var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();

This will give you a List<string> with all the links in your HTML. To filter this, you can either go this way

var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));

which is more flexible because you could check against a list of start strings or whatever. Or you do it in your regex, which will lead in better performance:

var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");

Bringing it all together to what you want as far as I understand

var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
               where match.Groups.Count >= 1
               select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();

firstAddress holds your link, if there is one.

Autres conseils

If your link will always start with the same path and isn't repeated on the page, you can use this (untested):

    var match = Regex.Match(html, @"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");

    if (match.Success)
    {
      var href = match.Groups["href"].Value;
      ....
    }
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top