Streaming XPath evaluation

https://stackoverflow.com/questions/996103

13-09-2019
|

Question

Are there any production-ready libraries for streaming XPath expressions evaluation against provided xml-document? My investigations show that most of existing solutions load entire DOM-tree into memory before evaluating xpath expression.

Solution

Would this be practical for a complete XPath implementation, given that XPath syntax allows for:

/AAA/XXX/following::*

and

/AAA/BBB/following-sibling::*

which implies look-ahead requirements ? i.e. from a particular node you're going to have to load the rest of the document anyway.

The doc for the Nux library (specifically StreamingPathFilter) makes this point, and references some implementations that rely on a subset of XPath. Nux claims to perform some streaming query capability, but given the above there will be some limitations in terms of XPath implementation.

OTHER TIPS

XSLT 3.0 provides streaming mode of processing and this will become a standard with the XSLT 3.0 W3C specification becoming a W3C Recommendation.

At the time of writing this answer (May, 2011) Saxon provides some support for XSLT 3.0 streaming .

There are several options:

DataDirect Technologies sells an XQuery implementation that employs projection and streaming, where possible. It can handle files into the multi-gigabyte range - e.g. larger than available memory. It's a thread-safe library, so it's easy to integrate. Java-only.
Saxon is an open-source version, with a modestly-priced more expensive cousin, which will do streaming in some contexts. Java, but with a .net port also.
MarkLogic and eXist are XML databases that, if your XML is loaded into them, will process XPaths in a fairly intelligent fashion.

Try Joost.

Though I have no practical experience with it, I thought it is worth mentioning QuiXProc ( http://code.google.com/p/quixproc/ ). It is a streaming approach to XProc, and uses libraries that provide streaming support for XPath amongst others..

FWIW, I've used Nux streaming filter xpath queries against very large (>3GB) files, and it's both worked flawlessly and used very little memory. My use case is been slightly different (not validation centric), but I'd highly encourage you to give it a shot with Nux.

I think I'll go for custom code. .NET library gets us quite close to the target, if one just wants to read some paths of the xml document.

Since all the solutions I see so far respect only XPath subset, this is also this kind of solution. The subset is really small though. :)

This C# code reads xml file and counts nodes given an explicit path. You can also operate on attributes easily, using xr["attrName"] syntax.

  int c = 0;
  var r = new System.IO.StreamReader(asArgs[1]);
  var se = new System.Xml.XmlReaderSettings();
  var xr = System.Xml.XmlReader.Create(r, se);
  var lstPath = new System.Collections.Generic.List<String>();
  var sbPath = new System.Text.StringBuilder();
  while (xr.Read()) {
    //Console.WriteLine("type " + xr.NodeType);
    if (xr.NodeType == System.Xml.XmlNodeType.Element) {
      lstPath.Add(xr.Name);
    }

    // It takes some time. If 1 unit is time needed for parsing the file,
    // then this takes about 1.0.
    sbPath.Clear();
    foreach(object n in lstPath) {
      sbPath.Append('/');
      sbPath.Append(n);
    }
    // This takes about 0.6 time units.
    string sPath = sbPath.ToString();

    if (xr.NodeType == System.Xml.XmlNodeType.EndElement
        || xr.IsEmptyElement) {
      if (xr.Name == "someElement" && lstPath[0] == "main")
        c++;
      // And test simple XPath explicitly:
      // if (sPath == "/main/someElement")
    }

    if (xr.NodeType == System.Xml.XmlNodeType.EndElement
        || xr.IsEmptyElement) {
      lstPath.RemoveAt(lstPath.Count - 1);
    }
  }
  xr.Close();

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow