How do I know if PDF pages are color or black-and-white?

https://stackoverflow.com/questions/641427

22-07-2019
|

Question

Given a set of PDF files among which some pages are color and the remaining are black & white, is there any program to find out among the given pages which are color and which are black & white? This would be useful, for instance, in printing out a thesis, and only spending extra to print the color pages. Bonus points for someone who takes into account double sided printing, and sends an appropriate black and white page to the color printer if it is are followed by a color page on the opposite side.

Solution

This is one of the most interesting questions I've seen! I agree with some of the other posts that rendering to a bitmap and then analyzing the bitmap will be the most reliable solution. For simple PDFs, here's a faster but less complete approach.

Parse each PDF page
Look for color directives (g, rg, k, sc, scn, etc)
Look for embedded images, analyze for color

My solution below does #1 and half of #2. The other half of #2 would be to follow up with user-defined color, which involves looking up the /ColorSpace entries in the page and decoding them -- contact me offline if this is interesting to you, as it's very doable but not in 5 minutes.

First the main program:

use CAM::PDF;

my $infile = shift;
my $pdf = CAM::PDF->new($infile);
PAGE:
for my $p (1 .. $pdf->numPages) {
   my $tree = $pdf->getPageContentTree($p);
   if (!$tree) {
      print "Failed to parse page $p\n";
      next PAGE;
   }
   my $colors = $tree->traverse('My::Renderer::FindColors')->{colors};
   my $uncertain = 0;
   for my $color (@{$colors}) {
      my ($name, @rest) = @{$color};
      if ($name eq 'g') {
      } elsif ($name eq 'rgb') {
         my ($r, $g, $b) = @rest;
         if ($r != $g || $r != $b) {
            print "Page $p is color\n";
            next PAGE;
         }
      } elsif ($name eq 'cmyk') {
         my ($c, $m, $y, $k) = @rest;
         if ($c != 0 || $m != 0 || $y != 0) {
            print "Page $p is color\n";
            next PAGE;
         }
      } else {
         $uncertain = $name;
      }
   }
   if ($uncertain) {
      print "Page $p has user-defined color ($uncertain), needs more investigation\n";
   } else {
      print "Page $p is grayscale\n";
   }
}

And then here's the helper renderer that handles color directives on each page:

package My::Renderer::FindColors;

sub new {
   my $pkg = shift;
   return bless { colors => [] }, $pkg;
}
sub clone {
   my $self = shift;
   my $pkg = ref $self;
   return bless { colors => $self->{colors}, cs => $self->{cs}, CS => $self->{CS} }, $pkg;
}
sub rg {
   my ($self, $r, $g, $b) = @_;
   push @{$self->{colors}}, ['rgb', $r, $g, $b];
}
sub g {
   my ($self, $gray) = @_;
   push @{$self->{colors}}, ['rgb', $gray, $gray, $gray];
}
sub k {
   my ($self, $c, $m, $y, $k) = @_;
   push @{$self->{colors}}, ['cmyk', $c, $m, $y, $k];
}
sub cs {
   my ($self, $name) = @_;
   $self->{cs} = $name;
}
sub cs {
   my ($self, $name) = @_;
   $self->{CS} = $name;
}
sub _sc {
   my ($self, $cs, @rest) = @_;
   return if !$cs; # syntax error                                                                                             
   if ($cs eq 'DeviceRGB') { $self->rg(@rest); }
   elsif ($cs eq 'DeviceGray') { $self->g(@rest); }
   elsif ($cs eq 'DeviceCMYK') { $self->k(@rest); }
   else { push @{$self->{colors}}, [$cs, @rest]; }
}
sub sc {
   my ($self, @rest) = @_;
   $self->_sc($self->{cs}, @rest);
}
sub SC {
   my ($self, @rest) = @_;
   $self->_sc($self->{CS}, @rest);
}
sub scn { sc(@_); }
sub SCN { SC(@_); }
sub RG { rg(@_); }
sub G { g(@_); }
sub K { k(@_); }

OTHER TIPS

Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100% (see Detecting all pages which contain color).

For example:

$ gs -q -o - -sDEVICE=inkcov file.pdf 
0.11264  0.11605  0.11605  0.09364 CMYK OK
0.11260  0.11601  0.11601  0.09360 CMYK OK

If the CMY values are not 0 then the page is color.

To just output the pages that contain colors use this handy oneliner:

$ gs -o - -sDEVICE=inkcov file.pdf |tail -n +4 |sed '/^Page*/N;s/\n//'|sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000  / d'

It is possible to use the Image Magick tool identify. If used on PDF pages it converts the page first to a raster image. If the page contained color can be tested using the -format "%[colorspace]" option, which for my PDF printed either Gray or RGB. IMHO identify (or what ever tool it uses in the background; Ghostscript?) does choose the colorspace depending on the presents of color.

An example is:

identify -format "%[colorspace]" $FILE.pdf[$PAGE]

where PAGE is the page starting from 0, not 1. If the page selection is not used all pages will be collapsed to one, which is not what you want.

I wrote the following BASH script which uses pdfinfo to get the number of pages and then loops over them. Outputting the pages which are in color. I also added a feature for double sided document where you might need a non-colored backside page as well.

Using the outputted space separated list the colored PDF pages can be extracted using pdftk:

pdftk $FILE cat $PAGELIST output color_${FILE}.pdf

#!/bin/bash

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES $N"
    else
        COLORPAGES="$COLORPAGES $N"
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $N $((N+1))"
            #N=$((N+1))
        else
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $((N-1)) $N"
        fi
    fi
    N=$((N+1))
done

echo $DOUBLECOLORPAGES
echo $COLORPAGES
echo $GRAYPAGES
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

The script from Martin Scharrer is great. It contains a minor bug: It counts two pages which contain color and are directly consecutive twice. I fixed that. In addition the script now counts the pages and lists the grayscale pages for double-paged printing. Also it prints the pages comma separated, so the output can directly be used for printing from a PDF viewer. I've added the code, but you can download it here, too.

Cheers, timeshift

#!/bin/bash

if [ $# -ne 1 ] 
then
    echo "USAGE: This script needs exactly one paramter: the path to the PDF"
    kill -SIGINT $$
fi

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""
DOUBLEGRAYPAGES=""
OLDGP=""
DOUBLEPAGE=0
DPGC=0
DPCC=0
SPGC=0
SPCC=0

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $DOUBLEPAGE -eq -1 ]]
    then
    DOUBLEGRAYPAGES="$OLDGP"
    DPGC=$((DPGC-1))
    DOUBLEPAGE=0
    fi
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES,$N"
    SPGC=$((SPGC+1))
    if [[ $DOUBLEPAGE -eq 0 ]]
    then
        OLDGP="$DOUBLEGRAYPAGES"
        DOUBLEGRAYPAGES="$DOUBLEGRAYPAGES,$N"
        DPGC=$((DPGC+1))
    else 
        DOUBLEPAGE=0
    fi
    else
        COLORPAGES="$COLORPAGES,$N"
    SPCC=$((SPCC+1))
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$N,$((N+1))"
        DOUBLEPAGE=$((N+1))
        DPCC=$((DPCC+2))
            #N=$((N+1))
        else
        if [[ $DOUBLEPAGE -eq 0 ]]
        then 
                DOUBLECOLORPAGES="$DOUBLECOLORPAGES,$((N-1)),$N"
        DPCC=$((DPCC+2))
        DOUBLEPAGE=-1
        elif [[ $DOUBLEPAGE -gt 0 ]]
        then
        DOUBLEPAGE=0            
        fi                      
        fi
    fi
    N=$((N+1))
done

echo " "
echo "Double-paged printing:"
echo "  Color($DPCC): ${DOUBLECOLORPAGES:1:${#DOUBLECOLORPAGES}-1}"
echo "  Gray($DPGC): ${DOUBLEGRAYPAGES:1:${#DOUBLEGRAYPAGES}-1}"
echo " "
echo "Single-paged printing:"
echo "  Color($SPCC): ${COLORPAGES:1:${#COLORPAGES}-1}"
echo "  Gray($SPGC): ${GRAYPAGES:1:${#GRAYPAGES}-1}"
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

ImageMagick has some built-in methods for image comparison.

http://www.imagemagick.org/Usage/compare/#type_general

There are some Perl APIs for ImageMagick, so maybe if you cleverly combine these with a PDF to Image converter you can find a way to do your black & white test.

I would try to do it like that, although there might be other easier solutions, and I'm curious to hear them, I just want to give it try:

Loop through all pages
Extract the pages to an image
Verify the color range of the image

For the page count, you can probably translate that without too much effort to Perl. It's basically a regex. It's also said that:

r"(/Type)\s?(/Page)[/>\s]"

You simply have to count how many times this regular expression occurs in the PDF file, minus the times you find the string "<>" (empty ages which are not rendered).

To extract the image, you can use ImageMagick to do that. Or see this question.

Finally, to get whether it is black and white, it depends if you mean literally black and white or grayscale. For black and white, you should only have, well, black and white in all the image. If you want to see grayscale, now, it's really not my speciality but I guess you could see if the averages of the red, the green and the blue are close to each other or if the original image and a grayscale converted one are close to each other.

Hope it gives some hints to help you go further.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow