Question

This is not so much 'how do I do xxx' but 'how do I do xxx optimally?' (really hoping the challenge floats Dimitre's boat...)

All of the following is complicated by the restriction of the XSL processor (msxsl - basically XSLT 1.0 with a node-set(), replaces() and matches() set of extension functions).

I am generating some metadata from certain elements in a book - let's say chapters and div[title] elements (to simplify our data model quite a bit).

Page numbers in the book are given by processing instructions in mixed text nodes that might look like this:

<?Page pageId="256"?>

The page number that my element needs to be associated with will either be the first descendant (in the case where the page break is essentially the first piece of content within, say, a chapter (i.e. the chapter starts with a new page)), or else the first preceding::processing-instruction('Page').

Let's make up a sample document:

<?xml version="1.0" encoding="UTF-8"?>
<book>
    <chapter>
        <title><?Page pageId="1"?>Chapter I</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>Second paragraph <?Page pageId="2"?>of introduction</p>
        </div>
        <div>
            <title>Section I</title>
            <p>A paragraph</p>
            <p>Another paragraph<?Page pageID="3"?></p>
        </div>
    </chapter>
    <chapter>
        <title><?Page pageId="4"?>Chapter II</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>...</p>
        </div>
    </chapter>
</book>

(note that although each chapter here starts on a new page, we can't generally guarantee that as a rule. There's a blank page at the end of chapter 1, something we see commonly).

I want to get out some information like this (I am fine with XSLT basics, we're interested in choosing the page numbers):

<meta>
    <meta>
        <field type="title">Chapter I</field>
        <field type="page">1</field>
        <meta>
            <field type="title">Section I</field>
            <field type="page">2</field>
        </meta>
    </meta>
    <meta>
        <field type="title">Chapter II</field>
        <field type="page">4</field>
    </meta>
</meta>

I can do various things using xsl:when statements and the descendant axis to decide which page number is appropriate, but I would much prefer to do something clever matching on processing-instructions, as currently using the descendant axis on large books is making things way too slow to be usable. Keys would be nice, but things are further complicated by being able to use neither variables nor other keys in the @use or @match attributes (and not being able to use sequence constructors, similarly).

Currently the elements I'm interested in finding page numbers for are defined in a key (real world data is much more complex) like the following:

<xsl:key name="auth" match="chapter|div[title]" use="generate-id()"/>

Any suggestions or pointers gratefully received!

Was it helpful?

Solution

Here is a solution using keys, which may be efficient:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:key name="kPage"
   match="chapter/title/processing-instruction('Page')"
   use="generate-id(..)"/>

 <xsl:key name="kPage"
   match="processing-instruction('Page')"
   use="generate-id(following::div[title][1]/title)"/>

 <xsl:template match="*">
  <xsl:apply-templates select=
   "*[1]|following-sibling::*[1]"/>
 </xsl:template>

 <xsl:template match="chapter/title[1] | div/title[1]">
  <meta>
    <field type="title"><xsl:value-of select="."/></field>
    <field type="page">
      <xsl:variable name="vPiText"
           select="key('kPage', generate-id())[last()]"/>
      <xsl:value-of select=
      "translate($vPiText,
                 translate($vPiText, '01234567890', ''),
                 ''
                 )"/>
    </field>

    <xsl:apply-templates select="*[1]|following-sibling::*[1]"/>
  </meta>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the provided XML document:

<book>
    <chapter>
        <title>
            <?Page pageId="1"?>Chapter I</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>Second paragraph 
                <?Page pageId="2"?>of introduction</p>
        </div>
        <div>
            <title>Section I</title>
            <p>A paragraph</p>
            <p>Another paragraph
                <?Page pageID="3"?></p>
        </div>
    </chapter>
    <chapter>
        <title>
            <?Page pageId="4"?>Chapter II</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>...</p>
        </div>
    </chapter>
</book>

the wanted, correct result is produced:

<meta>
   <field type="title">Chapter I</field>
   <field type="page">1</field>
   <meta>
      <field type="title">Section I</field>
      <field type="page">2</field>
   </meta>
</meta>
<meta>
   <field type="title">Chapter II</field>
   <field type="page">4</field>
</meta>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top