質問

When I try to convert an utf-8 encoded markdown file to pdf using Docverter (through the API), I just lose non-ASCII characters.

Any solution to that?

I want to convert .md -> .pdf. Maybe Docverter can help to do .md -> .html, then I could use some other library / service for the .html -> .pdf?

役に立ちましたか?

解決

Update (14/10/2013)

The problem with boolean options in Docverter is now solved, so you can now do a direct conversion from md to pdf, passing the option ascii=true to Docverter. This causes the intermediate HTML to use entities instead of utf-8, and thus the resulting pdf is OK.

Original answer

After a lot of research (I also had this same problem), I discovered that the bug is in the html->pdf conversion made by Docverter, which uses Flying Saucer libraries. This conversion ignores any non-ascii char in the HTML input, even if the charset is correctly set to utf-8 in the meta tags.

However, if the HTML contains entities such as ó etc, then Flying Saucer does include those characters, and assuming a font which has the correct encoding (default fonts used by the library are fine), the proper char (ó in this example) is shown in the resulting pdf.

So I ended up with the following approach:

  1. Use Docverter to convert .md -> html
  2. Process the resulting html to use HTML entities instead of utf-8
  3. Use Docverter again to convert .html -> .pdf

Step 2 is easy if you happen to use python. In this case, the following lines do the trick:

def fixHTML(filename):
   f = open(filename, "r")
   content = unicode(f.read(), "utf-8")  # Reads the file into a unicode string
   f.close()
   f = open(filename, "w")
   f.write(content.encode("ascii", "xmlcharrrefreplace")) # Writes with the fixed encoding

Note: This convoluted way should not be required because pandoc accepts the switch --ascii which forces it to produce HTML as the one obtained in step 2. However, Docverter parser for boolean options seems to be broken, so it is not possible to pass the option ascii to Docverter.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top