I am using docx4j for reading .docx files and I need to get the paragraph of a document and replace strings

StackOverflow https://stackoverflow.com/questions/13199900

  •  29-07-2021
  •  | 
  •  

Pregunta

I am using docx4j for reading and parsing .docx files but when I iterate through paragraphs it is reading in one pass not all of the paragraph. Below is a sample of the code I am using.

private void replaceAcrAndDef(String acrName, String acrParensName, String oldDef, String newDef){
    String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
    List<Object> paragraphs = template.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
    for (Object obj : paragraphs){
        Text text = (Text) ((JAXBElement)obj).getValue();
        String textValue = text.getValue();
        System.out.println(textValue);
 }

During one pass of the for loop above this will read as the first paragraph -

"Team has a deep understanding of the requirements by having direct MDA experience for the Mission, Test and Administrative and General Services networks and systems. The benefits to re a low risk, responsive Team with an established understanding of Mission, Processes and Priorities. Our use of an integrated based"

But it is missing the last parts of the paragraph. Which will come out in the consecutive passes. What am I doing wrong here.

The entire contents of the paragraph are :

Team has a deep understanding of the requirements by having direct MDA experience for the Mission, Test and Administrative and General Services networks and systems. The benefits to are a low risk, responsive Team with an established understanding of Mission, Processes and Priorities. Our use of an integrated Information Technology based Role-Based Administration (RBA) approach works in synergy with associate contractors, existing processes and the addition of our complementary processes.

I do not know if there is a way to get the entire paragraph or not but if there is that would be great as I need to do String replacement on a paragraph by paragraph basis.

¿Fue útil?

Solución

I expand my comments to an answer:

I guess, the paragraph contains more than one text element (w:t). Could you provide a sample document with this issue? What about extracting text with TextUtils.extractText on the paragraph element?

Try P.toString(). There TextUtils is referenced, which you can try with a StringWriter, too.


Using P.toString():

// Request paragraphs
final String XPATH_TO_SELECT_TEXT_NODES = "//w:p";
final List<Object> jaxbNodes = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);

for (Object jaxbNode : jaxbNodes){
    final String paragraphString = jaxbNode.toString();
    System.out.println(paragraphString);
}

Using TextUtils.extractText(...) and StringWriter:

for (Object jaxbNode : jaxbNodes){
    final StringWriter stringWriter = new StringWriter();
    TextUtils.extractText(jaxbNode, stringWriter);
    final String paragraphString = stringWriter.toString();
    System.out.println(paragraphString);
}

Otros consejos

I'm using those methods to perform search and replace using docx4j (inspired from http://www.smartjava.org/content/create-complex-word-docx-documents-programatically-docx4j):

public static List<Object> getAllElementFromObject(Object obj, Class<?> toSearch) {
    List<Object> result = new ArrayList<Object>();
    if (obj instanceof JAXBElement) obj = ((JAXBElement<?>) obj).getValue();

    if (obj.getClass().equals(toSearch))
        result.add(obj);
    else if (obj instanceof ContentAccessor) {
        List<?> children = ((ContentAccessor) obj).getContent();
        for (Object child : children) {
            result.addAll(getAllElementFromObject(child, toSearch));
        }
    }
    return result;
}

public static void findAndReplace(WordprocessingMLPackage doc, String toFind, String replacer){
    List<Object> paragraphs = getAllElementFromObject(doc.getMainDocumentPart(), P.class);
    for(Object par : paragraphs){
        P p = (P) par;
        List<Object> texts = getAllElementFromObject(p, Text.class);
        for(Object text : texts){
            Text t = (Text)text;
            if(t.getValue().contains(toFind)){
                t.setValue(t.getValue().replace(toFind, replacer));
            }
        }
    }
}

Hope this helps.

The XPath stuff in the Sun/Oracle JAXB contains a number of known flaws, which make it less useful in practise than its promise.

I don't use it. Instead I use something like:

    static class PFinder extends CallbackImpl {

            List<P> paragraphList = new ArrayList<P>();  

            @Override
            public List<Object> apply(Object o) {

                    if (o instanceof P ) {
                          paragraphList .add((P)o);
                    }                      
                    return null;
            }
    }

            PFinder PFinder = new PFinder();
            new TraversalUtil(paragraphs, PFinder);

            for ( P p : pFinder.paragraphList ) { ...

You could do something similar, looking for w:t

Or, if you really want to continue using XPath, you can now try MOXy

More generally, I'd suggest you consider using content control databinding, instead of your string replacement approach. In docx4j, content control data binding offers a range of advantages, including:

  • repeating material (eg rows of a table)
  • conditional inclusion/exclusion of content
  • inclusion of images (base64 encoded)
  • import of XHTML content
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top