Question

This is an error that has recently started to plague my rss feed parser. This morning four of my rss feeds started to throw this exception:

For security reasons DTD is prohibited in this XML document. To enable DTD processing set the DtdProcessing property on XmlReaderSettings to Parse and pass the settings into XmlReader.Create method.

The code used to work fine, but I believe there has been a change to these four specific rss feeds that is causing this issue. Something with the feed using DTD when it wasn't using one before or some sort of schema change that my SyndicationFeed is not being able to parse.

So I changed my code to

string url = RssFeed.AbsoluteUri;
XmlReaderSettings st = new XmlReaderSettings();

st.DtdProcessing = DtdProcessing.Parse;
st.ValidationType = ValidationType.DTD;

XmlReader reader = XmlReader.Create(url,st);

SyndicationFeed feed = SyndicationFeed.Load(reader);

reader.Close();

Then I started to receive this error:

The 'html' element is not declared. in System.Xml.XmlValidatingReaderImpl.ValidationEventHandling.System.Xml.IValidationEventHandling.SendEvent(Exception exception, XmlSeverityType severity) at System.Xml.Schema.BaseValidator.SendValidationEvent(String code, String arg) at System.Xml.Schema.DtdValidator.ProcessElement() at System.Xml.Schema.DtdValidator.ValidateElement() at System.Xml.Schema.DtdValidator.Validate() at System.Xml.XmlValidatingReaderImpl.ProcessCoreReaderEvent() at System.Xml.XmlValidatingReaderImpl.Read() at System.Xml.XmlReader.MoveToContent() at System.Xml.XmlReader.IsStartElement(String localname, String ns) at System.ServiceModel.Syndication.Atom10FeedFormatter.CanRead(XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load[TSyndicationFeed](XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load(XmlReader reader)

I have no idea where this 'html' element is coming from since neither the feed nor any visible dtd definition in the feed(http://jobs.huskyenergy.com/RSS) mentions it. I have also tried setting the Dtdprocessing to DtdProcessing.ignore however that results in the following error:

The element with name 'html' and namespace '' is not an allowed feed format.

which is more confusing because the the namespace is blank and I'm not sure where this god forsaken html element is coming from.

I'm very close to writing my own xml reader and scraping SyndicationFeed, however I want to make sure I exhaust all possible solutions before going that path.

One of the rss feeds if that helps any: http://jobs.huskyenergy.com/RSS

Was it helpful?

Solution

Here is a solution, that delivers new and filled SyndicationFeed object for/from the given RSS url:

var feedUrl = @"http://jobs.huskyenergy.com/RSS";
try
{
    var webClient = new WebClient();
    // hide ;-)
    webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
    // fetch feed as string
    var content = webClient.OpenRead(feedUrl);
    var contentReader = new StreamReader(content);
    var rssFeedAsString = contentReader.ReadToEnd();
    // convert feed to XML using LINQ to XML and finally create new XmlReader object
    var feed = SyndicationFeed.Load(XDocument.Parse(rssFeedAsString).CreateReader());
    // take the info from the firdst feed entry
    var firstFeedItem = feed.Items.FirstOrDefault();
    Console.WriteLine(firstFeedItem.Title.Text);
    Console.WriteLine(firstFeedItem.Links.FirstOrDefault().Uri.AbsoluteUri);
}
catch (Exception exception)
{
    Console.WriteLine(exception.Message);
}

The site processes apparently only calls from "browsers" so disguise the code resp. the call as one. The result is:

Summer Student UEO Regulatory & Environment Strategy - (Calgary, AB)
http://jobs.huskyenergy.com/ca/alberta/student/jobid4444904-summer-student-ueo-regulatory--environment-strategy-jobs

The WebClient class supports also asynchronous operations with events and tasks, so there is no problem to make the reader non blocking.


The explanation for the html problem is following: the site changed something and/or they are somehow not allowing automated feeds (anymore). The html message comes from the service interruption message. I tried to access the service (using LINQ to XML with LINQPad, don't wonder about the Dump function):

var feedUrl = @"http://jobs.huskyenergy.com/RSS";
var feedContent = XDocument.Load(feedUrl);
feedContent.Dump();
//var feed = SyndicationFeed.Load(feedContent.CreateReader());
//feed.Dump();

and got this answer:

<!DOCTYPE html []>
<!--[if IE 7]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9 lt-ie8"><![endif]-->
<!--[if IE 8]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en" prefix="og: http://ogp.me/ns#" class="non-js">
  <!--<![endif]-->
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width" />
    <title>
    Service Interruption
</title>
    <link rel="stylesheet" href="http://seostatic.tmp.com/SiteOutage/style.css" />
  </head>
  <body>
    <p id="outageMessage">This system is currently experiencing a service interruption. <br />We apologize for any inconvenience.</p>
  </body>
</html>

So html element revealed. :-) The site looks just fine when opened in a browser and this means the XmlReader resp. LINQ to XML is working correctly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top