docsplit conversion to PDF mangles non-ASCII characters in docx on Linux

https://stackoverflow.com/questions/19428720

01-07-2022
|

Question

My documentation management app involves converting a .docx file containing non-ASCII Unicode characters (Japanese) to PDF with docsplit (via the Ruby gem, if it matters). It works fine on my Mac. On my Ubuntu machine, the resulting PDF has square boxes where the characters should be, whether invoked through Ruby or directly on the command line. The odd thing is, when I open up the .docx file directly in LibreOffice and do a PDF export, it works fine. So it would seem there is some aspect to how docsplit invokes LO that causes the Unicode characters to be handled improperly. I have scoured various parts of the documentation and code for options that I might need to specify, with no luck. Any ideas of why this could be happening?

FWIW, docsplit invokes LO with the following options line in pdf_extractor.rb:

options = "--headless --invisible  --norestore --nolockcheck --convert-to pdf --outdir #{escaped_out} #{escaped_doc}"

I notice that the output format can optionally be followed by an output filter a in pdf:output_filter_name--is this something I need to think about using?

No correct solution

OTHER TIPS

I have tracked this down to the --headless option which docsplit passes to LibreOffice. That invokes a non-X version of LO, which apparently does not have the necessary Japanese fonts. Unfortunately, there appears to be no way to pass options to docsplit to tell it to omit the --headless option to LO, so I will end up patching or forking the code somehow.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow