Extracting information from millions of simple but inconsistent text files

https://stackoverflow.com/questions/5916901

29-10-2019
|

Question

We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).

I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect keyword values like that using various algorithms to make up for inconsistant formatting.

Is there any standard way of doing this, any links that might help? any other ideas?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow