Question

So in this grotty extruded typesetting product, I sometimes see links and email addresses that have been split apart. Example:

<p>Here is some random text with an email address 
<Link>example</Link><Link>@example.com</Link> and here 
is more random text with a url 
<Link>http://www.</Link><Link>example.com</Link> near the end of the sentence.</p>

Desired output:

<p>Here is some random text with an email address 
<email>example@example.com</email> and here is more random text 
with a url <ext-link ext-link-type="uri" xlink:href="http://www.example.com/">
http://www.example.com/</ext-link> near the end of the sentence.</p>

Whitespace between the elements does not appear to occur, which is one blessing.

I can tell I need to use an xsl:for-each-group within the p template, but I can't quite see how to put the combined text from the group through the contains() function so as to distinguish emails from URLs. Help?

Was it helpful?

Solution

If you use group-adjacent then you can simply string-join the current-group() as in

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xsd"
  version="2.0">

  <xsl:template match="p">
    <xsl:copy>
      <xsl:for-each-group select="node()" group-adjacent="boolean(self::Link)">
        <xsl:choose>
          <xsl:when test="current-grouping-key()">
            <xsl:variable name="link-text" as="xsd:string" select="string-join(current-group(), '')"/>
            <xsl:choose>
              <xsl:when test="matches($link-text, '^https?://')">
                <ext-link ext-link-type="uri" xlink:href="{$link-text}">
                  <xsl:value-of select="$link-text"/>
                </ext-link>
              </xsl:when>
              <xsl:otherwise>
                <email><xsl:value-of select="$link-text"/></email>
              </xsl:otherwise>
            </xsl:choose>
          </xsl:when>
          <xsl:otherwise>
            <xsl:apply-templates select="current-group()"/>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

OTHER TIPS

Here is an XSLT 1.0 solution based on the identity template, with special treatment for <Link> elements.

<xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="node()|@*" />
  </xsl:copy>
</xsl:template>

<xsl:template match="Link">
  <xsl:if test="not(preceding-sibling::node()[1][self::Link])">
    <xsl:variable name="link">
      <xsl:copy-of select="
        text()
        | 
        following-sibling::Link[
          preceding-sibling::node()[1][self::Link]
          and
          generate-id(current())
          =
          generate-id(
            preceding-sibling::Link[
              not(preceding-sibling::node()[1][self::Link])
            ][1]
          )
        ]/text()
      " />
    </xsl:variable>
    <xsl:choose>
      <xsl:when test="contains($link, '://')">
        <ext-link ext-link-type="uri" xlink:href="{$link}" />
      </xsl:when>
      <xsl:when test="contains($link, '@')">
        <email>
          <xsl:value-of select="$link" />
        </email>
      </xsl:when>
      <xsl:otherwise>
        <link type="unknown">
          <xsl:value-of select="$link" />
        </link>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:if>
</xsl:template>

I know that XPath expressions used are some quite a hairy monsters, but selecting adjacent siblings is not easy in XPath 1.0 (if someone has a better idea how to do it in XPath 1.0, go ahead and tell me).

not(preceding-sibling::node()[1][self::Link])

means "the immediately preceding node must not be a <Link>", e.g.: only <Link> elements that are "first in a row".

following-sibling::Link[
  preceding-sibling::node()[1][self::Link]
  and
  generate-id(current())
  =
  generate-id(
    preceding-sibling::Link[
      not(preceding-sibling::node()[1][self::Link])
    ][1]
  )
]

means

  • from all following-sibling <Link>s, choose the ones that
    • immediately follow a <Link> (e.g. they are not "first in a row"), and
    • the ID of the current() node (always a <Link> that's "first in a row") must be equal to:
    • the closest preceding <Link> that itself is "first in a row"

If that makes sense.

Applied to your input, I get:

<p>Here is some random text with an email address
<email>example@example.com</email> and here
is more random text with a url
<ext-link ext-link-type="uri" xlink:href="http://www.example.com" /> near the end of the sentence.</p>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top