Information Extraction - business documents

https://stackoverflow.com/questions/17252118

01-06-2022
|

Question

i'm currently trying to extract information, e.g. sender or recipient from business documents like bills. The documents were processed with ocr software into xml files, so they are annotated with formatting characteristics. I want to extract specific information from a new document after annotated one similar document manually with features like sender and recipient.

So my question is, if there is a learning or matching algorithm which is able to extract specific data by comparing with only one or two examples of similar documents. If yes: is there somehow a java framework capable of that?

Yours thankfully

maggu

Solution

If the XML structure is always the same (using the same template):

Just save the XML parent nodes of the selected nodes where the information is located so you know the path to the information. Shouldn't be a problem - trivial task.

If you have to search for the information:

It could work by creating certain feature extraction rules and then use that features to train a Support Vector Machine for detecting the areas where the information is located.

I once asked a similar question Algorithm to match natural text in mail.

But that is far from trivial, and definitely needs more than one or two training documents.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow