Parsing content from Word document using docx4j

https://stackoverflow.com/questions/12335332

01-07-2021
|

Question

Thanks to a previous answer, I'm now able to read my password-protected Word 2010 documents. (I have to translate them one by one from .doc to .docx. They go back to 1994, but that's okay.)

I wrote a simple Java class to get started:

package model.docx4j;

import model.JournalEntry;
import model.JournalEntryFactory;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.OpcPackage;
import org.docx4j.openpackaging.parts.Parts;

import java.io.IOException;
import java.io.InputStream;
import java.security.GeneralSecurityException;
import java.util.LinkedList;
import java.util.List;

/**
 * JournalEntryFactoryImpl using docx4j
 * @author Michael
 * @link
 * @since 9/8/12 12:44 PM
 */
public class JournalEntryFactoryImpl implements JournalEntryFactory {
    @Override
    public List<JournalEntry> getEntries(InputStream inputStream, String password) throws IOException, GeneralSecurityException {
        List<JournalEntry> journalEntries = new LinkedList<JournalEntry>();
        if (inputStream != null) {
            try {
                OpcPackage opcPackage = OpcPackage.load(inputStream, password);
                Parts parts = opcPackage.getParts();
            } catch (Docx4JException e) {
                LOGGER.error("Could not load document into docx4j", e);
                throw new IOException(e);
            }
        }
        return journalEntries;
    }
}

And a JUnit test to drive it:

package model.docx4j;

import model.JournalEntry;
import model.JournalEntryFactory;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.OpcPackage;
import org.docx4j.openpackaging.parts.Parts;

import java.io.IOException;
import java.io.InputStream;
import java.security.GeneralSecurityException;
import java.util.LinkedList;
import java.util.List;

/**
 * JournalEntryFactoryImpl using docx4j
 * @author Michael
 * @link
 * @since 9/8/12 12:44 PM
 */
public class JournalEntryFactoryImpl implements JournalEntryFactory {
    @Override
    public List<JournalEntry> getEntries(InputStream inputStream, String password) throws IOException, GeneralSecurityException {
        List<JournalEntry> journalEntries = new LinkedList<JournalEntry>();
        if (inputStream != null) {
            try {
                OpcPackage opcPackage = OpcPackage.load(inputStream, password);
                Parts parts = opcPackage.getParts();
            } catch (Docx4JException e) {
                LOGGER.error("Could not load document into docx4j", e);
                throw new IOException(e);
            }
        }
        return journalEntries;
    }
}

I put a breakpoint into the test to see what docx4j was doing once it read my document. I see a list of 8 parts, but I walked through the tree without finding the content.

Each document consists of a page with a date and content, but I can't find pages. Where do they live?

Solution

The main document content lives in the "main document part", which is often named "/word/document.xml".

The usual way to get it with docx4j is:

WordprocessingMLPackage wordMLPackage = (WordprocessingMLPackage)opcPackage;
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

but you'd expect your approach to work as well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow