Question

While coding with Python, if I had to load XHTML document with undefined entity, I would create a parser and update entity dict (i.e. nbsp):

import xml.etree.ElementTree as ET
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.parse(opener.open(url), parser=parser)

With VB.Net I tried to parse XHTML document as Linq XDocument:

Dim x As XDocument = XDocument.Load(url)

which raised XmlException:

Reference to undeclared entity 'nbsp'

Googling around I couldn't find any example how to update entity table or use simple means to be able to parse XHTML document with undefined entity.

How to solve this apparently simple problem?

Était-ce utile?

La solution

Entity resolution is done by the underlying parser which is here a standard XmlReader (or XmlTextReader).

Officially, you're supposed to declare entities in DTDs (see Oleg's answer here: Problem with XHTML entities), or load DTDs dynamically into your documents. There are some examples here on SO like this: How do I resolve entities when loading into an XDocument?

What you can also do is create a hacky XmlTextReader derived class that returns Text nodes when entities are detected, based on a dictionary, like I demonstrate here in the following sample code:

using (XmlTextReaderWithEntities reader = new XmlTextReaderWithEntities(MyXmlFile))
{
    reader.AddEntity("nbsp", "\u00A0");
    XDocument xdoc = XDocument.Load(reader);
}

...

public class XmlTextReaderWithEntities : XmlTextReader
{
    private string _nextEntity;
    private Dictionary<string, string> _entities = new Dictionary<string, string>();

    // NOTE: override other constructors for completeness
    public XmlTextReaderWithEntities(string path)
        : base(path)
    {
    }

    public void AddEntity(string entity, string value)
    {
        _entities[entity] = value;
    }

    public override bool Read()
    {
        if (_nextEntity != null)
            return true;

        return base.Read();
    }

    public override XmlNodeType NodeType
    {
        get
        {
            if (_nextEntity != null)
                return XmlNodeType.Text;

            return base.NodeType;
        }
    }

    public override string Value
    {
        get
        {
            if (_nextEntity != null)
            {
                string value = _nextEntity;
                _nextEntity = null;
                return value;
            }
            return base.Value;
        }
    }

    public override void ResolveEntity()
    {
        // if not found, return the string as is
        if (!_entities.TryGetValue(LocalName, out _nextEntity))
        {
            _nextEntity = "&" + LocalName + ";";
        }
        // NOTE: we don't use base here. Depends on the scenario
    }
}

This approach works in simple scenarios, but you may need to override some other stuff for completeness.

PS: sorry it's in C#, you'll have to adapt to VB.NET :)

Autres conseils

I haven't done this, but you could create a XmlParserContext object with required entity declarations as internalSubset. Pass that context to XmlTextReader in the constructor and create the XDocument object by loading the reader. In MSDN there already is a simple looking example code snippet in VB for using a pre-defined entity.

in this case i suppose your taking about of a page on the web so you may use html agility pack which could met your need.

I use xpath, element and more other stuff.It will very usefull to search into an html page etc.

You may find documentation here : htmlagilitypack

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top