Question

I am attempting to convert DOCX to DITA topics through an intermediate HTML step.

Now, with simple substitutions either in 'sed' or 'emacs' or 'vi', I can do most of the changes, but not certain types. For that I may need Perl or Python. Below is an example of what I am trying to accomplish:

From:

<h1> Head 1 </H1>
  <body> 
  </body>


 <h2>Sub Head 1 </h2>
  <body>
  </body>


  <h3>SubSub Head 1 </h3>
   <body> 
   </body>

 <h2>Sub Head 2 </h2>
 <body> 
 </body>

<h1>Head 2 </h1>
<body> 
</body>

To:

<topic><title> Head 1 </title>
  <body> 
  </body>

 <topic><title> Sub Head 1 </title>
  <body>
  </body>

  <topic><title> SubSub Head 1 </title>
   <body> 
   </body>
  </topic>
 </topic>

 <topic><title> Sub Head 2 </title>
 <body> 
 </body>
 </topic>
</topic>

<topic><title> Head 2 </title>
<body> 
</body>
</topic>

The part I have trouble with is the part where I need to place the tags for nested topics (and yes, I do have nested topics; my needs are somewhat unique since I am migrating existing documents). If someone can suggest a perl snippet (or a pointer to one similar) for this (placement of tags on a per tag basis), I can build my script around it.

Thanks in advance for looking and suggestions.

No correct solution

OTHER TIPS

That's the kind of processing I often use XML::Twig for.

The wrap_children method is designed just for this: it lets you define a regexp-like expression that will be wrapped in an element. See the example below and the docs for more:

#!/usr/bin/perl

use strict;
use warnings;

use Test::More tests => 1;

use XML::Twig;

# reads the DATA section, the input doc first, then the expected result
my( $in, $expected)= do{ local $/="\n\n"; <DATA>}; 

my $t=XML::Twig->new->parse( $in);
my $root= $t->root;

# that's where the wrapping occurs, form inside out
$root->wrap_children( '<h3><body>',                   topic => { level => 3 });
$root->wrap_children( '<h2><body><topic level="3">*', topic => { level => 2 });
$root->wrap_children( '<h1><body><topic level="2">*', topic => { level => 1 });

# now we cleanup: the levels are not used any more
foreach my $to ($t->descendants( 'topic'))
  { $to->del_att( 'level'); }

# the wrapping will have generated tons of additional id's, 
# you may not need this if your elements had id's before the wrapping
foreach my $to ($t->descendants( 'topic|body|h1|h2|h3'))  
  { $to->del_att( 'id'); }

# now we can deal with titles
foreach my $h  ($t->descendants( 'h1|h2|h3')) { $h->set_tag( 'title'); }

# how did we do?
is( $t->sprint( pretty_print => 'indented'), $expected, 'just one test');

__DATA__
<doc>
  <h1> Head 1 </h1>
    <body></body>
  <h2> Sub Head 1 </h2>
    <body></body>
  <h3> SubSub Head 1 </h3>
    <body></body>
  <h2> Sub Head 2 </h2>
    <body></body>
  <h1> Head 2 </h1>
    <body></body>
</doc>

<doc>
  <topic>
    <title> Head 1 </title>
    <body></body>
    <topic>
      <title> Sub Head 1 </title>
      <body></body>
      <topic>
        <title> SubSub Head 1 </title>
        <body></body>
      </topic>
    </topic>
    <topic>
      <title> Sub Head 2 </title>
      <body></body>
    </topic>
  </topic>
  <topic>
    <title> Head 2 </title>
    <body></body>
  </topic>
</doc>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top