Question

I'm trying to extract text except watermark text from PDF files with Apache PDFBox library,so I want to remove the watermark first and the rest is what I want.but unfortunately,Both PDmetadata and PDXObject can't recognize the watermark,any help will be appreciated.I found some code below.

        // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();//How to specify watermark instead of images??
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
Was it helpful?

Solution

In contrast to your assumption there is nothing like an explicit watermark object in a PDF to recognize watermarks in generic PDFs.

Watermarks can be applied to a PDF page in many ways; each PDF creating library or application has its own way to add watermarks, some even offer multiple ways.

Watermarks can be

  1. anything (Bitmap graphics, vector graphics, text, ...) drawn early in the content and, therefore, forming a background on which the rest of the content is drawn;
  2. anything (Bitmap graphics, vector graphics, text, ...) drawn late in the content with transparency, forming a transparent overlay;
  3. anything (Bitmap graphics, vector graphics, text, ...) drawn in the content stream of a watermark annotation which shall be used to represent graphics that shall be printed at a fixed size and position on a page, regardless of the dimensions of the printed page (cf. section 12.5.6.22 of the PDF specification ISO 32000-1).

Some times even mixed forms are used, have a look at this answer for an example, at the bottom you find a 'watermark' drawn above graphics but beneath text (to allow for easy reading).

The latter choice (the watermark annotation) obviously is easy to remove, but it actually also is the least often used choice, most likely because it is so easy to remove; people applying watermarks generally don't want their watermarks to get lost. Furthermore, annotations are sometimes handled incorrectly by PDF viewers, and code copying page content often ignores annotations.

If you do not handle generic documents but a specific type of documents (all generated alike), on the other hand, the very manner in which the watermarks are applied in them, probably can be recognized and an extraction routine might be feasible. If you have such a use case, please share a sample PDF for inspection.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top