Pregunta

Rewritten to look more like a programming question

Okay, so I have done a little more research and it looks like the java package I need to use is docx4j. Unfortunately, my lack of familiarity with the package as well as the underpinnings of the PDF format makes it difficult for me to figure out exactly how to make use of the headers and footers returned SectionWrapper.getHeaderFooterPolicy(). It's not entirely clear whether the HeaderPart and FooterPart objects returned are writeable or how to modify them.

There is this code which offers an example of how to create a header part but it creates a new HeaderPart and adds it to the document.

I want to find existing header/footer parts and either remove them if possible or empty them out. Ideally they would be entirely gone from the document.

This code is similar and allows you to set the text of a headerpart using setJaxbElement but so much of this terminology is unfamiliar and I'm concerned the end result will be me creating headers (albeit empty ones) in each document rather than removing them.

Original Question Below

I am dealing with a set of wildly varying MS Word documents. I am compiling them into a single PDF and want to make sure that none of them have headers or footers before doing so.

Ideally, I'd also like to override their default font if it isn't Times New Roman.

Is there any way to do this programmatically or using some sort of batch process?

I will be running this on a Windows server that doesn't currently have Office or Word installed (although I think it might have an install of OpenOffice, and of course it's easy to just add an install as well).

Right now I'm using some version of iText (java) to convert the files to PDF. I know that apparently iText can't do things like removing headers/footers, but since the underlying structure of modern .doc files is XML, I'm wondering if there is an API (or even a XML parsing/editing API or, if all else fails, a RegEx [horrors]) for removing the headers and footers and setting some default styles.

¿Fue útil?

Solución

Here is some code hot off the press to do what you want:

public class HeaderFooterRemove  {

public static void main(String[] args) throws Exception {

    // A docx or a dir containing docx files
    String inputpath = System.getProperty("user.dir") + "/testHF.docx";

    StringBuilder sb = new StringBuilder(); 

    File dir = new File(inputpath);

    if (dir.isDirectory()) {

        String[] files = dir.list();

        for (int i = 0; i<files.length; i++  ) {

            if (files[i].endsWith("docx")) {
                sb.append("\n\n" + files[i] + "\n");
                removeHFFromFile(new java.io.File(inputpath + "/" + files[i]));     
            }
        }

    } else if (inputpath.endsWith("docx")) {
        sb.append("\n\n" + inputpath + "\n");
        removeHFFromFile(new java.io.File(inputpath ));     
    }

    System.out.println(sb.toString());

}

public static void removeHFFromFile(File f) throws Exception {


    WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
            .load(f);

    MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();

    // Remove from sectPr
    SectPrFinder finder = new SectPrFinder(mdp);
    new TraversalUtil(mdp.getContent(), finder);
    for (SectPr sectPr : finder.getSectPrList()) {
        sectPr.getEGHdrFtrReferences().clear();
    }

    // Remove rels
    List<Relationship> hfRels = new ArrayList<Relationship>(); 
    for (Relationship rel : mdp.getRelationshipsPart().getRelationships().getRelationship() ) {

        if (rel.getType().equals(Namespaces.HEADER)
                || rel.getType().equals(Namespaces.FOOTER)) {
            hfRels.add(rel);
        }
    }
    for (Relationship rel : hfRels ) {
        mdp.getRelationshipsPart().removeRelationship(rel);
    }

        wordMLPackage.save(f);              
    }
}

The above code relies on SectPrFinder, so copy that somewhere.

I've left the imports out, for brevity. But you can copy those from GitHub

When it comes to making the set of docx into a single PDF, obviously you can either merge them into a single docx, then convert that to PDF, or convert them all to PDF, then merge those PDFs. If you prefer the former approach (for example, because end-users want to be able to edit the package of documents), then you may wish to consider our commercial extension for docx4j, MergeDocx.

Otros consejos

To remove the header/footer, there is a quite easy solution:

Open the docx as a Zip, and remove the files named header*.xml/footer*.xml (situated in word folder).

Structure of a unzipped docx: https://stackoverflow.com/tags/docx/info

To really remove the link (if you won't do it it will probably corrupted):

You need to edit the document.xml.rels file, and remove all the RelationsShips that include a footer/header. This is a relationship that you should remove:

<Relationship Id="rId13" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer2.xml"/>

and more generally all that contain type='footer' or type='header'

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top