Question

I created a program to read and extract text from PDF files... But it producing this exception during execution..

java.io.IOException: Error: Expected a long type, actual='930[299'
    at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1669)
    at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
    at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:632)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1205)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1172)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1097)
    at PatentAdder.main(PatentAdder.java:60)

This is my code :

import java.awt.Rectangle;
import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.TrueFileFilter;
import org.apache.commons.io.filefilter.WildcardFileFilter;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFTextStripperByArea;


public class PatentAdder {

    /**
     * @param args
     */

    public static String patno,patit,patdate,patfilled,appno;
    private static int File;
    public static void main(String[] args) {
        // TODO Auto-generated method stub

        try {

int cnt=0;

        if( args.length == 1 )
        {
          //  usage();
        }
        else
        {
            PDDocument document = null;
            try
            {

                  File dataDir = new File("F:/patents/test/tittest/USP2002w17/06/378/pdfs");

                  File[] files = dataDir.listFiles();



                  //  String[] files = dataDir.list();
                 int count=0;
                // System.out.println ("Satrt1");

                 for (File file : files) {
                    //  System.out.println ("Satrt2");
                      File f = file;

                      if (!f.isDirectory()) {
                document = PDDocument.load(f.getAbsolutePath());
                if( document.isEncrypted() )
                {
                    try
                    {
                        document.decrypt( "" );
                    }
                    catch( InvalidPasswordException e )
                    {
                        System.err.println( "Error: Document is encrypted with a password." );
                        System.exit( 1 );
                    }
                } }
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition( true );
             //  Rectangle rectt = new Rectangle( 590, 108, 600, 100 ); // enlarge title
                Rectangle rectt = new Rectangle( 288, 60, 222, 40 );
            Rectangle rect = new Rectangle( 55, 108, 230, 600 ); //  US-Patent title h40
            //   Rectangle rect = new Rectangle( 108, 210, 480, 499 ); //full enlarge
               stripper.addRegion( "class1", rect );
                stripper.addRegion("class2", rectt);

                List allPages = document.getDocumentCatalog().getAllPages();
                PDPage firstPage = (PDPage)allPages.get( 0 );
                stripper.extractRegions( firstPage );



               String title = "(?s)\\(54\\)\\s*([\\w\\s,-]+)|(?s)\\[54\\]\\s*([\\w\\s,-]+)";
               String in ="((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\[\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Assignee:))|((?s)\\(\\d\\d\\)\\s+Inventor:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Inventor\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Assignee:))";
               String as ="((?s)\\(\\d\\d\\)\\s+Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=\\(\\d*\\)\\s+Notice:))|((?s)\\(\\d\\d\\)\\s+Assignee:\\s*([\\-\\w\\d\\s,\\.\\(\\)-]+)*[\\w\\']*(?=\\n))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+);([\\w\\s.\\',();-]+)(?=Notice:))|(Assignee\\w*:\\s*\\w*([\\w\\d,.\\s)(-]+)(?=Notice:))";
               String app_no ="(?s)\\(21\\)\\s*([\\w\\s,.://-]+)|(?s)\\[21\\]\\s*([\\w\\s,.://-]+)";
               String filed ="((?s)\\(22\\)\\s*([\\w\\s,.://-]+))|((?s)\\(22\\)\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+))|((?s)\\[22\\]\\s*([\\w\\s,.://-]+)(?=\\s*\\n\\s*Related))";
               String term ="((?s)\\s*Term\\s*([\\w\\s,.://-]+))|((?s)\\s*Term\\s*([\\w\\s,.://-]+))";
               String pat_no = "(?s)\\s*Patent No\\.\\:\\s*([\\w\\d\\s,.://-]+)|(?s)\\s*Patent Number\\:\\s*([\\w\\d\\s,.://-]+)";
               String pat_dt = "(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventor:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\(45\\)\\s*Date([\\*\\w\\d\\s,.://-]+)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\[\\d*\\]\\s+Inventor:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)(?=\\(\\d*\\)\\s+Inventors:)|(?s)\\[45\\]\\s*Date([\\*\\w\\d\\s,.://-]+)";

                //   System.out.println(rg);
                String region = stripper.getTextForRegion( "class1" );
              //  System.out.println(region);

                String regiont = stripper.getTextForRegion( "class2" );
                Pattern p = Pattern.compile(in);
                Matcher m = p.matcher(region);

                Pattern p2 = Pattern.compile(as);
                Matcher m2 = p2.matcher(region);

                Pattern p3 = Pattern.compile(title);
                Matcher m3 = p3.matcher(region);

                Pattern p4 = Pattern.compile(pat_no);
                Matcher m4 = p4.matcher(regiont);

                Pattern p5 = Pattern.compile(app_no);
                Matcher m5 = p5.matcher(region);

                Pattern p6 = Pattern.compile(filed);
                Matcher m6 = p6.matcher(region);


                Pattern p7 = Pattern.compile(pat_dt);
                Matcher m7 = p7.matcher(regiont);


                while(m.find())
                 {
                   // System.out.println(m.group());


                 }

                while(m2.find())
                {
              //    System.out.println(m2.group());
                }

                while(m3.find())
                {
                 // System.out.println(m3.group());
                    patit = m3.group().replace("(54)", " ");
                    patit = patit.trim();
                }

                while(m4.find())
                {
                //  System.out.println(m4.group());
                    patno = m4.group().replace("Patent No.: ", " ");
                    patno = patno.replace("Patent No: ", " ");
                    patno = patno.replace("Patent", " ");
                    patno = patno.replace("No.:", " ");
                    patno = patno.replace("No:", " ");
                    patno = patno.replace("Number: ", " ");
                    patno = patno.replace("Number.: ", " ");
                    patno =  patno.trim();
                }

                while(m5.find())
                {
               //   System.out.println(m5.group());
                appno = m5.group().replace("(21)", " ");
                appno = appno.replace("Appl. No.: ", " ");
                appno = appno.replace("Appl.", " ");
                appno = appno.replace("No.", " ");
                appno = appno.replace(":"," ");
                appno = appno.trim();
                }

                while(m6.find())
                {
                 // System.out.println(m6.group());
                  patfilled = m6.group().replace("(22)", " ");
                  patfilled = patfilled.replace("Filed", " ");
                  patfilled= patfilled.replace("PCT", " ");
                  patfilled = patfilled.replace(":", " ");
                  patfilled = patfilled.replace("\n", "");
                  patfilled= patfilled.trim();
                }

                while (m7.find())
                {
                    patdate = m7.group().replace("(45) Date of Patent: ", " ");
                    patdate = patdate.replace("(45) Date of Patent.: ", " ");
                    patdate = patdate.replace("(45)", " ");
                    patdate = patdate.replace("Date", " ");
                    patdate = patdate.replace("of", " ");
                    patdate = patdate.replace("Patent.: ", " ");
                    patdate = patdate.replace("Patent: ", " ");
                    patdate = patdate.replace("Reissued", " ");
                    patdate = patdate.replace(":", " ");
                    patdate = patdate.replace("Patent", " ");
                    patdate = patdate.replace("*", " ");
                    patdate = patdate.trim();

                }
                System.out.println("File name:"+f.getName());
                System.out.println(patno +"\n"+patit+"\n"+patdate+"\n"+patfilled+"\n"+appno+"\n-------");

                //  boolean st = addPatent (patno,patit,patdate,patfilled,appno);

                //  if ( st == true ) System.out.println(patno+" added");
                //  else System.out.println(patno+" not added");
                  count++;
                  } 

                  System.out.print("-----Finised "+count+" Files------ \n");


            }
            finally
            {
                if( document != null )
                {
                    document.close();
                }
            }


        }

        }

        catch (Exception e)
        {
            System.out.println(e.getStackTrace());
            //System.out.println(e.getLocalizedMessage());
            System.out.println(e.getMessage());
            System.out.println(e.getCause());
            //System.out.println(e.getClass());
            e.printStackTrace();




        }

    }

    static boolean addPatent(String pno,String ptitle,String pat_date ,String filed_date , String appl_no )
    {
        int i=0;
        boolean status =false;
        try {
            Class.forName("com.mysql.jdbc.Driver").newInstance();
            Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/patent", "root","ragesh"); 


            PreparedStatement st = con.prepareStatement("insert into patents_info values (?,?,?,?,?,?)");
            st.setString(1, pno);
            st.setString(2, ptitle);
            st.setString(3,pat_date);
            st.setString(4,filed_date);
            st.setString(5,appl_no);
            st.setInt(6,0);

            i=st.executeUpdate();

            if (i > 0) status= true;

        }
            catch (Exception e)
            {
                e.printStackTrace();
            }


    return status;

    }


    public static List<File> getAllChildFiles(File[] dir)
    {

        List<File> result = new ArrayList<File>();
        for (File file : dir)
        {
            if (file.isDirectory())
            {
                File[] children = file.listFiles();
                List<File> grandChildren = getAllChildFiles(children);
                result.addAll(grandChildren);
            }
            else
            {
                result.add(file);
            }
        }

        return result;
    }

    }

This programs gives output up to some iterations , but halts and thorw exception like above specified ..

Sample output with Exception :

File name:06019327.pdf
Number:   6,019,327
[54] INSTALLATION STRUCTURE OF OUTDOOR   
COMMUNICATION DRIVE
[45]          Feb. 1, 2000
[22]      Aug. 30, 1996  
Related U.S. Application Data
[21]  08/704,920
-------
File name:06019328.pdf
Number:   6,019,328
[54] STAY-PUT PEGBOARD ACCESSORY
[45]          Feb. 1, 2000
[22]      Jan. 27, 1999
[21]  09/238,242
-------
File name:06019329.pdf
Number:   6,019,329
[54] CLAMPS
[45]          Feb. 1, 2000
[22]      Oct. 30, 1997
[21]  08/961,310
-------
File name:06019330.pdf
Number:   6,019,330
[54] ROOF GUARD DEVICE FOR LIFTING   
OBJECTS ON TO A ROOF
[45]          Feb. 1, 2000
[22]      Nov. 20, 1997
[21]  08/974,866
-------
File name:06019331.pdf
Number:   6,019,331
[54] CANTILEVER BRACKET ASSEMBLY
[45]          Feb. 1, 2000
[22]      May 28, 1997   
Related U.S. Application Data
[21]  08/865,587
-------
[Ljava.lang.StackTraceElement;@43a6684f
Error: Expected a long type, actual='930[299'

java.io.IOException: Error: Expected a long type, actual='930[299'
    at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1669)
    at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:100)
    at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:632)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1205)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1172)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1097)
    at PatentAdder.main(PatentAdder.java:60)

2nd Problem

Sometimes the execution freezes.. That is it just showing the blinking cursor after some more iterations .... Why... ?

 File name:06019329.pdf
    Number:   6,019,329
    [54] CLAMPS
    [45]          Feb. 1, 2000
    [22]      Oct. 30, 1997
    [21]  08/961,310
    -------
    File name:06019330.pdf
    Number:   6,019,330
    [54] ROOF GUARD DEVICE FOR LIFTING   
    OBJECTS ON TO A ROOF
    [45]          Feb. 1, 2000
    [22]      Nov. 20, 1997
    [21]  08/974,866
    -------
    File name:06019331.pdf
    Number:   6,019,331
    [54] CANTILEVER BRACKET ASSEMBLY
    [45]          Feb. 1, 2000
    [22]      May 28, 1997   
    Related U.S. Application Data
    [21]  08/865,587
    -------


(__ cursor blinks on... and execution freezes )

Please help me to resolve this 2 issues:

JDK version : 1.6 PDF Box 1.8.3

Was it helpful?

Solution

This is caused by PDFBox not following the PDF Reference to the letter :)

Tokens in a PDF token stream may be delimited by white space (as usual for most programming language), but also implicitly: because the next character is a delimiter of its own, since it introduces a special function. Therefore, it's totally valid -- and certainly not unusual -- to encounter constructions such as

/A[123/B(C)]

which is entirely equivalent to the slightly longer

/A [ 123 /B (C) ]

From ISO "PDF 32000-1:2008", 7.2.2 Character Set:

The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings, streams, and comments.

The White-space characters shown [...]

The delimiter characters (, ), <, >, [, ], {, }, /, and % are special [..]

The original code shows the current implementation (taken from http://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java):

/**
1681         * This method is used to read a token by the {@linkplain #readInt()} method and the {@linkplain #readLong()} method.
1682         *  
1683         * @return the token to parse as integer or long by the calling method.
1684         * @throws IOException throws by the {@link #pdfSource} methods.
1685         */
1686        protected final StringBuilder readStringNumber() throws IOException
1687        {
1688            int lastByte = 0;
1689            StringBuilder buffer = new StringBuilder();
1690            while( (lastByte = pdfSource.read() ) != 32 &&
1691                    lastByte != 10 &&
1692                    lastByte != 13 &&
1693                    lastByte != 60 && //see sourceforge bug 1714707
1694                    lastByte != 0 && //See sourceforge bug 853328
1695                    lastByte != -1 )
1696            {
1697                buffer.append( (char)lastByte );
1698            }
1699            if( lastByte != -1 )
1700            {
1701                pdfSource.unread( lastByte );
1702            }
1703            return buffer;
1704        }

The 'next character' is tested against the whitespace characters from Table 1 in 7.2.2 (top to bottom, "Space", "Line Feed", "Carriage Return", and the Nul character -- though they are still missing the "Form Feed" code 0x0C and, very odd, the common "Tab" 0x09. They do test, however, for an end-of-file (the -1) and < (60), the latter probably because someone ran into a similar bug before. (I could not locate the original bug report #1714707 but I can infer it must have been similar to your issue.)

This list must be completed by adding the following characters, copied verbatim from Table 2 in 7.2.2:

Table 2 – Delimiter characters
Glyph   Decimal   Hexadecimal   Octal   Name
  (       40          28          50    LEFT PARENTHESIS
  )       41          29          51    RIGHT PARENTHESIS [1]
  <       60          3C          60    LESS-THAN SIGN
  >       62          3E          62    GREATER-THAN SIGN
  [       91          5B         133    LEFT SQUARE BRACKET
  ]       93          5D         135    RIGHT SQUARE BRACKET
  {      123          7B         173    LEFT CURLY BRACKET
  }      125          7D         175    RIGHT CURLY BRACKET
  /       47          2F          57    SOLIDUS
  %       37          25          45    PERCENT SIGN

The odd ones out are { and } since, currently, they only appear inside PostScript snippets, and those are not base objects but contained inside a stream. But perhaps they were historically "reserved for future expansion" (which should no longer be an issue, now the PDF format has been frozen as an ISO specification).

Also, the character % in itself is a delimiter, but it needs some special handling as well as it introduces a comment:

The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line [...] (7.2.3 Comments)

(Note there is a little ambiguity there:

A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.

which should not be necessary, because the previous line already says the comment ends before the end-of-line; and so the end-of-line itself ought to remain in the input stream and thus act as a separator. Perhaps nothing more than a case of a belt-and-suspenders approach.)


[1] On reviewing: actually, the closing parenthesis is redundant. It can only occur after a matching opening parentheses, and that introduces a string. Viewed one token at a time, you should never encounter a stray ) -- if you do, that indicates a malformed PDF stream.

OTHER TIPS

The readLong method reads a long from the underlaying stream. As the PDFBox API states that method is throwing an IOException that has been generated by the PushBackInputStream used as input source (pdfSource).

In your case the log is pretty self-explanatory, it seems there's a square bracket '[' in your stream, which make the long conversion impossible.

You have two options:

  • check you input and your parser logic (or perform a sanity check before using PDDocument.load)
  • narrow the scope of your try and catch block to line 60 of your class to handle the specific IOException and react accordingly (if possible in your method logic)

About the freeze issues

Are you sure the code is not stuck in one of your:

while(mX.find()) 
{ 
  ... 
} 

blocks? I find the design pretty error prone, especially for X = 1 and 2. I have no time to go into the logic but you may want to refactor the while condition as follow:

long TIMEOUT = 15000l; // 15 seconds
long now = System.currentTimeMillis(); // init the long just above the while

while(mX.find() && (System.currentTimeMillis() - now) < TIMEOUT)
{
   ...
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top