Question

I extracted text from pdf document. .. I want to extract some particular fields in it using java..

The portion of text ..

US00RE44697E
(i9) United States
(12) Reissued Patent (10) Patent Number: RE44,697 E
Jones et al. (45) Date of ReissuedPatent: Jan. 7, 2014
(54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT
(75) Inventors: David E.Jones, Ottawa (CA); Cormac
M.O'Connell, Carp (CA)
(73) Assignee: Mosaid Technologies Incorporated,
Ottawa, Ontario (CA)
(21) Appl.No.: 13/603,137
(22) Filed: Sep. 4, 2012
Related U.S. Patent Documents
Reissue of:
(64) Patent No.:
Issued:
Appl. No.:
Filed:
6,088,800
Jul. 11, 2000
09/032,029
Feb. 27, 1998
(51) Int.CI.
G06F 21/00 (2013.01)
(52) U.S. CI.
USPC .............713/189; 713/190; 713/193; 380/28;
380/33; 380/52
(58) Field of Classification Search
None

Now my mission is to extract fields form it and give to strings.. that is

the text (10) Patent Number: RE44,697 E will be extracted as String pat_no= " RE44,697 E"

the text (54) ENCRYPTIONPROCESSORWITH SHARED MEMORY INTERCONNECT will be extracted as String title= "ENCRYPTIONPROCESSORWITH SHARED MEMORY INTERCONNECT"

the extremely irregular text block

(64) Patent No.:
Issued:
Appl. No.:
Filed:
6,088,800
Jul. 11, 2000
09/032,029
Feb. 27, 1998

have to be extracted as

String pat_no_org = "6,088,800";
String issued = "jul.11,2000" 
String filed = "feb 27 ,1998"
......

like this..

My Works

First i used the string.split , string.substring , string,indexof and even apache string utils , but none helped.. Because the text are scattered , above methods doesn't helped.. I also tried regular expressions ,but since I very weak in it I can't program .

Please tell me how to achieve my objective using java ?

Was it helpful?

Solution

With regex, I would split it in 3 parts:

1.) (10) Patent Number the regex could look like this:

\(10\)\s*Patent Number:\s*([\w,]+)

as a java string:

"\\(10\\)\\s*Patent Number:\\s*([\\w,]+)"

The matches for the first parenthesized group will be in [1].


2.) (54) ENCRYPT...

A pattern could look like:

(?s)\(54\)\s*(.*?)\s*(?=\(\d|$\))

as a java string:

"(?s)\\(54\\)\\s*(.*?)\\s*(?=\\(\\d|$\\))"
  • (?s) The s modifier equals Pattern.DOTALL where the dot matches new-lines too.
  • (?=\(\d|$\)) a lookahead is used, to match (.*?) lazy any amount of any characters until another ( followed by a digit | or string-end $ (anchor for end) is seen.

3.) For the other desired 3 parts I would try to reflect formatting of the input with the pattern. This requires, that all data is constructed compatible. A pattern could look like this:

(?s)\(64\).*?Filed:\s*([\d,]+)\s*(\w+\.\s*\d+,\s*\d+)\s*\n[\d+][^\n]+\n\s*(\w+\.\s*\d+,\s*\d+)

as a java string:

"(?s)\\(64\\).*?Filed:\\s*([\\d,]+)\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)\\s*\\n[\\d+][^\\n]+\\n\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)"
  • \n matches a newline.

Matches will be in [1] e.g. 6,088,800, [2] e.g. Jul. 11, 2000 and [3] e.g. Feb. 27, 1998.

For getting started with regex, this is too much information at once :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top