Converting Docx to image using Docx4j and PdfBox causes OutOfMemoryError

https://stackoverflow.com/questions/12850753

06-07-2021
|

Pregunta

I'm converting the first page of a docx file to an image in twoo steps using dox4j and pdfbox but I'm currently getting an OutOfMemoryError every time.

I've been able to determine that the exception is thrown on the very last step of this process, while the convertToImage method is being called, however I've been using the second step of this method to convert pdfs for some time now without issue so I am at a loss as to what might be the cause unless perhaps dox4j is encoding the pdf is a way which I have not yet tested or is corrupt.

I've tried replacing the ByteArrayOutputStream with a FileOutputStream and the pdf seems to render correctly is not any larger than I would expect.

This is the code I am using:

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);

((org.docx4j.convert.out.pdf.viaXSLFO.Conversion)c).setSaveFO(File.createTempFile("fonts", ".fo"));
ByteArrayOutputStream os = new ByteArrayOutputStream();
c.output(os, new PdfSettings());

byte[] bytes = os.toByteArray();
os.close();

ByteArrayInputStream is = new ByteArrayInputStream(bytes);

PDDocument document = PDDocument.load(is);

PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);

is.close();
document.close();

Edit To give more context on this situation, this code is being run in a grails web-application. I have tried several different variants of this code, including nulling out everything once no longer needed, using FileInputStream and FileOutputStream to try to conserve more physical memory and inspect the output of docx4j and pdfbox, each of which seem to work correctly.

I'm using docx4j 2.8.1 and pdfbox 0.7.3, I have also tried pdf-renderer but I still get an OutOfMemoryError. My suspicions are that docx4j is using too much memory but does not produce the error until the pdf to image conversion.

I would gladly except an alternate way of converting a docx file to a pdf or directly to an image as an answer, however I am currently trying to replace jodconverter which has been problematic to run on a server.

Solución

I'm part of XDocreport team.

We recently develop a little webapp deployed on cloudbees (http://xdocreport-converter.opensagres.cloudbees.net/) that shows the behaviour converters.

You can easily compare the behaviour and the performances of docx4j and xdocreport for PDF and Html convertion.

Source code can be found here :

https://github.com/pascalleclercq/xdocreport-demo (REST-Service-Converter-WebApplication subfolder). and here : https://github.com/pascalleclercq/xdocreport/blob/master/remoting/fr.opensagres.xdocreport.remoting.converter.server/src/main/java/fr/opensagres/xdocreport/remoting/converter/server/ConverterResourceImpl.java

The firsts numbers I get is that Xdocreport is roughly 10 time faster for generating a PDF than Docx4J.

Feedback is welcome.

Otros consejos

Glorious success at last! I replaced docx4j with XDocReport and the document converts to a PDF in no time at all. However there seems to be some issues with some documents but I would expect this is due to the OS that they were created on and may be solved by using:

PDFViaITextOptions options = PDFViaITextOptions.create().fontEncoding("windows-1250");

Using the approiate OS instead of just:

PDFViaITextOptions options = PDFViaITextOptions.create();

Which defaults to the current OS.

This is the code I now use to convert from DOCX to PDF:

FileInputStream in = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(in);

PDFViaITextOptions options = PDFViaITextOptions.create();

ByteArrayOutputStream out = new ByteArrayOutputStream();
XWPF2PDFViaITextConverter.getInstance().convert(document, out, options);

byte[] bytes = out.toByteArray();
out.close();

ByteArrayInputStream is = new ByteArrayInputStream(bytes);
PDDocument document = PDDocument.load(is);

PDPage page = (PDPage) document.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 96);

is.close();
document.close();

return image;

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow