Question

Superficially, an easy question: how do I get a great-looking PDF from my XML document? Actually, my input is a subset of XHTML with a few custom attributes added (to save some information on citation sources, etc). I've been exploring some routes and would like to get some feedback if anyone has tried some of this before.

Note: I've considered XSL-FO to generate PDFs but heard the typographic quality of open source tools is still lagging behind TeX a lot. Guess the most advanced one is Apache FOP. But I'm really interested in a great-looking PDFs (otherwise I could use the print dialog of my browser). Any thoughts, updates on this?

So I've been thinking of using XSLT to convert my customized XML/XHTML dialect to DocBook and go from there (DocBook via XSLT to proper HTML seems to work quite well, so I might use it for that as well). But how do I go from DocBook to TeX? I've come across a number of solutions.

  • dblatex A set of XSLT stylesheets that output LaTeX.
  • db2latex Started as a clone of dblatex but now provides tighter integration with LaTex packages and provides a single script to output PDF, which is quite nice.
  • passiveTex Instead of XSLT it uses a XML parser written in TeX.
  • TeXML is essentially an XML serialization of the LaTeX language which can be used as an intermediate format and an accompanying python tool that transforms from that XML format to LaTeX/ConTeXt. They claimed that this avoids the existing solutions' problems with special symbols, losing some braces or spaces and support for only latin-1 encoding. (Is this still the case?)

As my input XML might contains quite a few special characters represented in Unicode, the last point is especially important to me. I've also been thinking of using XeTeX instead of pdfTeX to get around this problem. (I might loose some typographic quality though, but maybe still better than current open source XSL-FO processors?) So db2latex and TeXML seem to be the favorites. So can anybody comment on the robustness of those?

Alternatively, I might have more luck using ConTeXt directly, as there seems to be quite some interest in the ConTeXt community in XML. Especially, I might take a deeper look at "My Way: Getting Web Content and pdf-Output from One Source" and "Dealing with XML in ConTeXt MkIV". Both documents describe an approach using ConTeXt combined with LuaTeX. (DocBook In ConTeXt seems to do about the same but the latest version is from 2003.) The second document notes:

You may wonder why we do these manipulations in TEX and not use xslt instead. The advantage of an integrated approach is that it simplifies usage. Think of not only processing the a document, but also using xml for managing resources in the same run. An xslt approach is just as verbose (after all, you still need to produce TEX code) and probably less readable. In the case of MkIV the integrated approach is is also faster and gives us the option to manipulate content at runtime using Lua.

What do you think about this? Please keep in mind that I have some experience with both XSLT and TeX but have never gone terribly deep into either of them. Never tried many different LaTeX packages or alternatives such as ConTeXt (or XeTeX/LuaTeX instead of pdfTeX) but I am willing to learn some new stuff to get my beautiful PDFs in the end ;)

Also, I stumbled over Pandoc but couldn't find any info on how it compares to the other mentioned approaches. And lastly, a link to some quite extensive documentation on how to use TeXML with ConTeXt.

Was it helpful?

Solution 3

In the end, I've decided to go with Pandoc, seems to be very polished and solid code base. One potential drawback is that you have to limit yourself to the number of markup features available in Pandoc's internal representation which maps basically one-to-one to its extended markdown.

Because I didn't think generating markdown from my XHTML-like source was a good idea, I succeeded in initiating a pandoc component that reads DocBook, which is currently in the master branch of Pandoc's development repo. So now I've a simple XSLT stylesheet that converts from my XHTML dialect to DocBook (which is also XML) and then I use Pandoc to export to a hoist of other formats, including PDF via ConTeXt.

OTHER TIPS

I've done something like this in the past (that is, maintaining master versions of documents in XML, and wanting to produce LaTeX output from them).

I've used PassiveTeX in the past, but I found creating stylesheets to be hard work -- the usual result of writing two languages at once. I got it to work, and the result looked very good, but it was probably more effort than it was worth. That said, if you amount of styling you need to add is small, then this might be a good route, because it's a single step.

The most successful route (read, flexible and attractive), was to use XSLT to transform the document into structural LaTeX, which matches the intended structure of the result document, but which doesn't attempt to do more than minimal formatting. Depending on your document, that might be normal-looking LaTeX, or it might have bespoke structures. Then write or adapt a LaTeX stylesheet or class file which formats that output into something attractive. That way, you're using XSLT to its strengths (and not going beyond them, which rapidly becomes very frustrating), using LaTeX to its strengths, and not confusing yourself.

That is, this more-or-less matches the approach of your first two alternatives, and whether you go with them, or write/customise a LaTeX stylesheet with bespoke output, is a function of how comfortable you feel with LaTeX stylesheets, and how much complicated or specialised formatting you need to do.

Since you say you need to handle Unicode characters in the input, then yes, XeLaTeX would be a good choice for the LaTeX part of the pipeline.

You might want to check questions tagged with XML on TeX.sx, especially this one. I suggest you use ConTeXt; the current version has no problems with Unicode and can handle OpenType perfectly - and it's programmable in Lua. The most often used alternative with LaTeX is XMLTeX, but that needs a lot of TeX foo.

If your documents can be handled by pandoc, use that: You'll have multiple output options, more than from any TeX-based system.

If you want more options on how to customize your TeX output, I would suggest using this:

xml2tex

It's based on a declarative configuration where you can specify your mapping from XML to TeX. MathML and XML tables (HTML and CALS) are automatically converted to TeX. Thus, it's Open Source and provides ready-to-use configurations for DocBook and DITA.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top