Question

I need to transform an XML using XSL into another XML where original nodes must be put into tables so each table must not exceed a maximum number of characters, let's say 200. By characters I mean only the text content of the nodes, the node names do not count.

Another rule that must be met is that the text of a node can not be cut between two tables: if there is a table that already have nodes and the node you want to add will make the table exceed the maximum 200 characters, then you must close this table and the node should be added to a new table.

I found a very similar case here Serialize XML based on Character Count during an XSL transformation, but it wasn't really answered with a solution. In that case he wants to split the input XML into several output XML files, but in my case I just want only one output XML file with the tables inside.

Any idea on how it could be achieved? Is it even possible? I appreciate any help and if you need some more information please let me know.

Thanks!


UPDATE 1

A sample of input XML:

<om>
<title>
    <title-en>Document title in english</title-en>
    <title-es>Título del documento en español</title-es>
</title>
<section1 id="1">
    <title>
        <title-en>Section 1 title in english</title-en>
        <title-es>Título de la sección 1 en español</title-es>
    </title>
    <p>
        <p-en>Some text 1,<br/>more text</p-en>
        <p-es>Texto 1,<br/>más texto</p-es>
    </p>
    <ul>
        <li>
            <li-en>List text 1. See section <a href="2">2</a></li-en>
            <li-es>Texto de lista 1. Ver sección <a href="2">2</a></li-es>
        </li>
        <li>
            <li-en>List text 2</li-en>
            <li-es>Texto de lista 2</li-es>
        </li>
        <li>
            <li-en>List text 3</li-en>
            <li-es>Texto de lista 3</li-es>
        </li>
    </ul>
    <p>
        <p-en>Some text 2.</p-en>
        <p-es>Texto 2.</p-es>
    </p>
    <p>
        <p-en>Some text 3.</p-en>
        <p-es>Texto 3.</p-es>
    </p>
    <section2 id="2">
        <title>
            <title-en>Section 2 title in english</title-en>
            <title-es>Título de la sección 2 en español</title-es>
        </title>
        <p>
            <p-en>Some text 4. <b>Bold text</b></p-en>
            <p-es>Texto 4. <b>Texto en negrita</b></p-es>
        </p>
        <p>
            <p-en>Some text 5.</p-en>
            <p-es>Texto 5.</p-es>
        </p>
    </section2>
</section1>

It's estructured as a dual language document, so every tag has an EN tag child and an ES tag child, except for those such as bold, br, a, etc. The character count must take into account that language nodes always have to be together. I mean, splitted tables must contain always both language nodes inside their father node. Maybe this will be more clear seeing the sample of output XML below...

Here you have a diagram of high level nodes of input XML: http://dolphin-tecnologias.com/detede/diagram.png

Some comments about the diagram:

  • <ul> and <ol> have 1 or more <li> children
  • Every <title>, <p> and <li> nodes have an EN tag child and an ES tag child

A sample of output XML:

<om>
<table>
    <tr>
        <td class="h1">
            <title-en>Document title in english</title-en>
        </td>
        <td class="h1">
            <title-es>Título del documento en español</title-es>
        </td>
    </tr>
    <tr id="1">
        <td class="h2">
            <title-en>Section 1 title in english</title-en>
        </td>
        <td class="h2">
            <title-es>Título de la sección 1 en español</title-es>
        </td>
    </tr>
    <tr>
        <td class="p">
            <p-en>Some text 1,<br/>more text</p-en>
        </td>
        <td class="p">
            <p-es>Texto 1,<br/>más texto</p-es>
        </td>
    </tr>
</table>
<table>
    <tr>
        <td class="ul">
            <li-en>List text 1. See section <a href="2">2</a></li-en>
        </td>
        <td class="ul">
            <li-es>Texto de lista 1. Ver sección <a href="2">2</a></li-es>
        </td>
    </tr>
    <tr>
        <td class="ul">
            <li-en>List text 2</li-en>
        </td>
        <td class="ul">
            <li-es>Texto de lista 2</li-es>
        </td>
    </tr>
    <tr>
        <td class="ul">
            <li-en>List text 3</li-en>
        </td>
        <td class="ul">
            <li-es>Texto de lista 3</li-es>
        </td>
    </tr>
    <tr>
        <td class="p">
            <p-en>Some text 2</p-en>
        </td>
        <td class="p">
            <p-es>Texto 2</p-es>
        </td>
    </tr>
    <tr>
        <td class="p">
            <p-en>Some text 3</p-en>
        </td>
        <td class="p">
            <p-es>Texto 3</p-es>
        </td>
    </tr>
</table>
<table>
    <tr id="2">
        <td class="h3">
            <title-en>Section 2 title in english</title-en>
        </td>
        <td class="h3">
            <title-es>Título de la sección 2 en español</title-es>
        </td>
    </tr>
    <tr>
        <td class="p">
            <p-en>Some text 4. <b>Bold text</b></p-en>
        </td>
        <td class="p">
            <p-es>Texto 4. <b>Texto en negrita</b></p-es>
        </td>
    </tr>
    <tr>
        <td class="p">
            <p-en>Some text 5</p-en>
        </td>
        <td class="p">
            <p-es>Texto 5</p-es>
        </td>
    </tr>
</table>

I've built it manually, I hope I didn't make any mistake...

As you can see, language nodes are always together in the same row, each one in a cell of that row. The class attribute of <td> indicates the type of node (header level, paragraph, list).

Assuming the maximum number of characters per table is 200, and following the rules I explained at the beginning of the post, with the above sample input XML we have 3 tables of 153, 151 and 126 characters.

Était-ce utile?

La solution

I have tried to tackle this with XSLT 2.0 and for-each-group but I had difficulties finding a grouping expression, I always needed/wanted to compute the string-length for the following element and I don't know of a way in XSLT 2.0 to do that. So I looked at other options and XQuery 3.0 with its window feature allows that.

Using Saxon 9.5 PE and the XQuery

xquery version "3.0";

declare variable $size as xs:integer external := 200;

declare function local:pair($element) {
  ($element, $element/following-sibling::*[1])
};

let $start-elements := //title-en | //p-en | //li-en
let $elements := $start-elements | //title-es | //p-es | //li-es
for tumbling window $table in $start-elements
    start $start when true()
    end $end next $enext when 
      sum(
        (local:pair($start)/string-length(), 
         $elements[$start << .
                   and . << $enext]/string-length(),
         local:pair($enext)/string-length())) gt $size
return <table>
         { for $el in $table
           return <tr>
                    {
                      for $pair in local:pair($el)
                      return <td class="{local-name($pair/..)}">{$pair}</td>
                    }
                  </tr>
         }
       </table>

with your sample input I get the result

<?xml version="1.0" encoding="UTF-8"?>
<table>
   <tr>
      <td class="title">
         <title-en>Document title in english</title-en>
      </td>
      <td class="title">
         <title-es>Título del documento en español</title-es>
      </td>
   </tr>
   <tr>
      <td class="title">
         <title-en>Section 1 title in english</title-en>
      </td>
      <td class="title">
         <title-es>Título de la sección 1 en español</title-es>
      </td>
   </tr>
   <tr>
      <td class="p">
         <p-en>Some text 1,<br/>more text</p-en>
      </td>
      <td class="p">
         <p-es>Texto 1,<br/>más texto</p-es>
      </td>
   </tr>
</table>
<table>
   <tr>
      <td class="li">
         <li-en>List text 1. See section <a href="2">2</a>
         </li-en>
      </td>
      <td class="li">
         <li-es>Texto de lista 1. Ver sección <a href="2">2</a>
         </li-es>
      </td>
   </tr>
   <tr>
      <td class="li">
         <li-en>List text 2</li-en>
      </td>
      <td class="li">
         <li-es>Texto de lista 2</li-es>
      </td>
   </tr>
   <tr>
      <td class="li">
         <li-en>List text 3</li-en>
      </td>
      <td class="li">
         <li-es>Texto de lista 3</li-es>
      </td>
   </tr>
   <tr>
      <td class="p">
         <p-en>Some text 2.</p-en>
      </td>
      <td class="p">
         <p-es>Texto 2.</p-es>
      </td>
   </tr>
   <tr>
      <td class="p">
         <p-en>Some text 3.</p-en>
      </td>
      <td class="p">
         <p-es>Texto 3.</p-es>
      </td>
   </tr>
</table>
<table>
   <tr>
      <td class="title">
         <title-en>Section 2 title in english</title-en>
      </td>
      <td class="title">
         <title-es>Título de la sección 2 en español</title-es>
      </td>
   </tr>
   <tr>
      <td class="p">
         <p-en>Some text 4. <b>Bold text</b>
         </p-en>
      </td>
      <td class="p">
         <p-es>Texto 4. <b>Texto en negrita</b>
         </p-es>
      </td>
   </tr>
   <tr>
      <td class="p">
         <p-en>Some text 5.</p-en>
      </td>
      <td class="p">
         <p-es>Texto 5.</p-es>
      </td>
   </tr>
</table>

which I think has the structure you want. There is fine-tuning left to get the right class attributes for instance but let us first know whether XQuery 3.0 like provided by Saxon PE or EE or other XQuery engines is an option for you.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top