Question

I make RSS reader and I need to find path url image (Google RSS) using regex expression. URL image from RSS is for example: RSS channel is https://news.google.com/?output=rss.

<img src="//t0.gstatic.com/images?q=tbn:ANd9GcRfMZ3MOzznCthFKCdIan17n9B8vZvEE-tRSQVTcgJa5i1OPfdf90zi4mBuGzPfB7Bj2mwE0TE" alt="" border="1" width="80" height="80" />

btw. I use regex expressions:

Regex regx = new Regex("\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))", 

RegexOptions.IgnoreCase);

Some advice?

Was it helpful?

Solution

First, you should not parse xml with regex -> use XmlDocument, XmlParser, Readers,...

If you know what you are doing here is the quick and dirty regex solution.

  1. All image Tags in your Feed seems to be in descriptions-Tags and they are of course xml encoded (just keep that in mind the next few steps)
  2. Next you should look for some example img tags
    1. Are you looking for img-tags without src too, or with empty source?
    2. Overall -> define what you are looking for
  3. Design your Regex

because the feed is generated automatically the tags seems to be in the same order every time (we use that fact for shorter regex)

Each img Tag starts with < (but keep point 1 in mind -> xml encoded) looking for < followed by img (current regex: &lt;img

Next followed by at least one whitespace char. (current regex: lt;img\s+

the src attribute is always the first attribute (in this case) if present so we select src=&quot; (current regex: &lt;img\s+src=&quot;)

Next select the url itselt with .* but be carefull the * quantifier is greedy so we have to use Lazy quantification .*? and finally close with &quot;

Final regex: &lt;img\s+src=&quot;(.*?)&quot; Make sure that you use brackets for the url for easy group selection.

Last Step: C# Code

//quick & dirty :-)
var url = "https://news.google.com/?output=rss";
var regex = @"&lt;img\s+src=&quot;(.*?)&quot;";

var RssContent = new StreamReader(((HttpWebRequest)HttpWebRequest.Create(url)).GetResponse().GetResponseStream()).ReadToEnd();
foreach (Match match in Regex.Matches(RssContent, regex))
{
    //print img urls
    Debug.WriteLine(match.Groups[1]);
}

PS: If you are trying to write an RSS-reader you should NOT use Regex to parse html at all! try to find a way to transform html into xaml and write your reader in WPF or start with learning more about those problems by studying some open source RSS readers

OTHER TIPS

You can use the below regex patter:

/(.*\/images.*)/
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top