Apache POI HWPF - problem in convert doc file to pdf

https://stackoverflow.com/questions/3352116

02-10-2020
|

Question

I am currently working Java project with use of apache poi. Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour. My pdf file looks like a black & white. While my doc file is coloured and have different style of text.

This is my code,

 POIFSFileSystem fs = null;  
 Document document = new Document(); 

 try {  
     System.out.println("Starting the test");  
     fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));  

     HWPFDocument doc = new HWPFDocument(fs);  
     WordExtractor we = new WordExtractor(doc);  

     OutputStream file = new FileOutputStream(new File("/document/test.pdf")); 

     PdfWriter writer = PdfWriter.getInstance(document, file);  

     Range range = doc.getRange();
     document.open();  
     writer.setPageEmpty(true);  
     document.newPage();  
     writer.setPageEmpty(true);  

     String[] paragraphs = we.getParagraphText();  
     for (int i = 0; i < paragraphs.length; i++) {  

         org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
        // CharacterRun run = pr.getCharacterRun(i);
        // run.setBold(true);
        // run.setCapitalized(true);
        // run.setItalic(true);
         paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");  
     System.out.println("Length:" + paragraphs[i].length());  
     System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());  

     // add the paragraph to the document  
     document.add(new Paragraph(paragraphs[i]));  
     }  

     System.out.println("Document testing completed");  
 } catch (Exception e) {  
     System.out.println("Exception during test");  
     e.printStackTrace();  
 } finally {  
                 // close the document  
    document.close();  
             }  
 }

please help me.

Thnx in advance.

Solution

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.

The Tika class is https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.

OTHER TIPS

If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.

Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
    org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
    int j = 0;
    while (true) {
        CharacterRun run = poiPara.getCharacterRun(j++);
        System.out.println("Color "+run.getColor());
        System.out.println("Font size "+run.getFontSize());
        System.out.println("Font Name "+run.getFontName());
        System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
        System.out.println("Text is "+run.text());
        if (run.getEndOffset() == poiPara.getEndOffset()) {
            break;
        }
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow