Question

I am using Apache POI.

I am able to read text from a doc file by using "org.apache.poi.hwpf.extractor.WordExtractor"

Even fetched the tables by using "org.apache.poi.hwpf.usermodel.Table"

But please suggest me, how can I fetch bold/italic formatting of the text.

Thanks in advance.

Was it helpful?

Solution

WordExtractor returns only the text, nothing else.

The simplest way for you to get the text+formatting of a word document is to switch to using Apache Tika. Apache Tika builds on top of Apache POI (amongst others), and offers both plain text extraction and rich extraction (XHTML with formatting).

Alternately, if you want to write the code yourself, I'd suggest you review the code in Tika's WordExtractor, which demonstrates how to use Apache POI to get the formatting information of runs of text out.

OTHER TIPS

Instead of using WordExtractor, you can read with Range:

...
HWPFDocument doc = new HWPFDocument(fis);
Range r = doc.getRange();
...

Range is the central class of that model. When you get range, you can play more with the features of the texts and, for instance, iterate through all CharacterRuns, and check if it is Italic (.isItalic()) or change to Italic: (.setItalic(true)).

for(int i = 0; i<r.numCharacterRuns(); i++)
        {
            CharacterRun cr = r.getCharacterRun(i);
            cr.setItalic(true);
            ...
        }

...
File fon = new File(yourFilePathOut);
FileOutputStream fos = new FileOutputStream(fon);
doc.write(fos); 
...

It works if you are stick to use HWPF. Between, to frame into and work with the concept of Paragraph is more convenient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top