Question

While coding with Python, if I had to load XHTML document with undefined entity, I would create a parser and update entity dict (i.e. nbsp):

import xml.etree.ElementTree as ET
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.parse(opener.open(url), parser=parser)

With VB.Net I tried to parse XHTML document as Linq XDocument:

Dim x As XDocument = XDocument.Load(url)

which raised XmlException:

Reference to undeclared entity 'nbsp'

Googling around I couldn't find any example how to update entity table or use simple means to be able to parse XHTML document with undefined entity.

How to solve this apparently simple problem?

Was it helpful?

Solution

Entity resolution is done by the underlying parser which is here a standard XmlReader (or XmlTextReader).

Officially, you're supposed to declare entities in DTDs (see Oleg's answer here: Problem with XHTML entities), or load DTDs dynamically into your documents. There are some examples here on SO like this: How do I resolve entities when loading into an XDocument?

What you can also do is create a hacky XmlTextReader derived class that returns Text nodes when entities are detected, based on a dictionary, like I demonstrate here in the following sample code:

using (XmlTextReaderWithEntities reader = new XmlTextReaderWithEntities(MyXmlFile))
{
    reader.AddEntity("nbsp", "\u00A0");
    XDocument xdoc = XDocument.Load(reader);
}

...

public class XmlTextReaderWithEntities : XmlTextReader
{
    private string _nextEntity;
    private Dictionary<string, string> _entities = new Dictionary<string, string>();

    // NOTE: override other constructors for completeness
    public XmlTextReaderWithEntities(string path)
        : base(path)
    {
    }

    public void AddEntity(string entity, string value)
    {
        _entities[entity] = value;
    }

    public override bool Read()
    {
        if (_nextEntity != null)
            return true;

        return base.Read();
    }

    public override XmlNodeType NodeType
    {
        get
        {
            if (_nextEntity != null)
                return XmlNodeType.Text;

            return base.NodeType;
        }
    }

    public override string Value
    {
        get
        {
            if (_nextEntity != null)
            {
                string value = _nextEntity;
                _nextEntity = null;
                return value;
            }
            return base.Value;
        }
    }

    public override void ResolveEntity()
    {
        // if not found, return the string as is
        if (!_entities.TryGetValue(LocalName, out _nextEntity))
        {
            _nextEntity = "&" + LocalName + ";";
        }
        // NOTE: we don't use base here. Depends on the scenario
    }
}

This approach works in simple scenarios, but you may need to override some other stuff for completeness.

PS: sorry it's in C#, you'll have to adapt to VB.NET :)

OTHER TIPS

I haven't done this, but you could create a XmlParserContext object with required entity declarations as internalSubset. Pass that context to XmlTextReader in the constructor and create the XDocument object by loading the reader. In MSDN there already is a simple looking example code snippet in VB for using a pre-defined entity.

in this case i suppose your taking about of a page on the web so you may use html agility pack which could met your need.

I use xpath, element and more other stuff.It will very usefull to search into an html page etc.

You may find documentation here : htmlagilitypack

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top