Question

I am trying to run the Java Code written by Stefano Chizzolini (Awesome guy : Creator of PDFClown) to Parse a PDF using PDF Clown library. I am getting this error and I dont know what I can do to fix this.

Exception in thread "main" org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
at org.pdfclown.documents.contents.fonts.OpenFontParser.getName(OpenFontParser.java:570)
at org.pdfclown.documents.contents.fonts.OpenFontParser.load(OpenFontParser.java:221)
at org.pdfclown.documents.contents.fonts.OpenFontParser.<init>(OpenFontParser.java:205)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:626)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at PDFReader.FullExtract.run(FullExtract.java:71)
at PDFReader.FullExtract.main(FullExtract.java:142)

I know the class OpenFontParser in the library package is throwing this error. Is there anything I can do to fix this?

This code works for most PDF. I have a PDF that it does not parse. I am guessing it is because of this symbol below in the pdf.

public class PDFReader extends Sample {

@Override
public void run()
{
    String filePath = new String("C:\\Users\\XYZ\\Desktop\\SomeSamplePDF.pdf");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    // 3. Extracting text from the document pages...
    for(Page page : document.getPages())
    {
    extract(new ContentScanner(page)); // Wraps the page contents into a scanner.

    }
    close(file);
}

private void close(File file) {
    // TODO Auto-generated method stub

}

/**
Scans a content level looking for text.
 */
/*
NOTE: Page contents are represented by a sequence of content objects,
possibly nested into multiple levels.
 */
private void extract(
        ContentScanner level
        )
{
    if(level == null)
        return;

    while(level.moveNext())
    {
        ContentObject content = level.getCurrent();
        if(content instanceof ShowText)
        {
            Font font = level.getState().getFont();
            // Extract the current text chunk, decoding it!
            System.out.println(font.decode(((ShowText)content).getText()));
        }
        else if(content instanceof Text
                || content instanceof ContainerObject)
        {
            // Scan the inner level!
            extract(level.getChildLevel());
        }
    }
}

private boolean prompt(Page page)
{
    int pageIndex = page.getIndex();
    if(pageIndex > 0)
    {
        Map<String,String> options = new HashMap<String,String>();
        options.put("", "Scan next page");
        options.put("Q", "End scanning");
        if(!promptChoice(options).equals(""))
            return false;
    }

    System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
    return true;
}

public static void main(String args[])
{
    new PDFReader().run();
    }

}
Was it helpful?

Solution

The issue

As the stacktrace indicates, the problem is that some TrueType font embedded in the PDF does not contain a name table even though it is a required table:

org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
...
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)

Thus, strictly speaking, that embedded font is invalid and consequentially the embedding PDF, too. And PDFClown runs into an exception due to this validity issue.

Some backgrounds

A TrueType font file consists of a sequence of concatenated tables. ...

The first of the tables is the font directory, a special table that facilitates access to the other tables in the font. The directory is followed by a sequence of tables containing the font data. These tables can appear in any order. Certain tables are required for all fonts. Others are optional depending upon the functionality expected of a particular font.

Tables that are required must appear in any valid TrueType font file. The required tables and their tag names are shown in Table 2.

Table 2: The required tables

Tag     Table 
'cmap'  character to glyph mapping 
'glyf'  glyph data 
'head'  font header 
'hhea'  horizontal header 
'hmtx'  horizontal metrics 
'loca'  index to location 
'maxp'  maximum profile 
'name'  naming 
'post'  PostScript 

(Section TrueType Font files: an overview in chapter 6 The TrueType Font File in the TrueType Reference Manual)

On the other hand, though, there are a number of PDF generators cutting down embedded TrueType fonts to the bare essentials required by PDF viewers (foremost Adobe Reader), and the name table does not seem to be strictly required.

Furthermore the table name is only used for one purpose in PDFClown, to determine the name of the font in question, even though the font name could be determined from the BaseFont entry of the associated font dictionary, too. Actually the latter entry is required by the PDF specification while the PostScript name of the font entry in the name table is optional according to the TTF manual.

Thus, using the BaseFont entry in the PDF font dictionary would be a better alternative to this name table access.

Fixing it

Is there anything I can do to fix this?

You can either fix the not entirely valid PDF by adding a name table to the embedded TTF in question or you can patch PDFClown to ignore the missing missing table: in the class org.pdfclown.documents.contents.fonts.OpenFontParser edit the method getName:

private String getName(
  int id
  ) throws EOFException, UnsupportedEncodingException
{
  // Naming Table ('name' table).
  Integer tableOffset = tableOffsets.get("name");
  if(tableOffset == null)
    throw new ParseException("'name' table does NOT exist.");

Replace that throw new ParseException("'name' table does NOT exist.") by return null.

PS

While the problem could be analyzed using merely the information given by the OP, the sample file provided by @akarshad in his now deleted answer gave more motivation to start the analysis at all.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top