Extracting paragraph breaks from OCR text?

https://stackoverflow.com/questions/5925561

30-10-2019
|

Question

I'm trying to recreate the paragraphs and indentations from the output of OCR'd image text, like so:

Input (imagine that this is an image, not typed):

enter image description here

Output (with a few mistakes):

enter image description here

As you can see, no paragraph breaks or indentations are preserved.

Using Python, I tried an approach like this, but it doesn't work (fails too often):

Code:

def smart_format(text):
  textList = text.split('\n')
  temp = ''

  averageLL = sum([len(line) for line in textList]) / len(textList)

  for line in textList:
    if (line.strip().endswith('!') or line.strip().endswith('.') or line.strip().endswith('?')) and not line.strip().endswith('-'):
      if averageLL - len(line) > 7:
        temp += '{{ paragraph }}' + line + '\n'
      else:
        temp += line + '\n'
    else:
      temp += line + '\n'

  return temp.replace(' -\n', '').replace('-\n', '').replace(' \n', '').replace('\n', ' ').replace('{{ paragraph }}', '\n\n      ')

Does anyone have any suggestions as how I could recreate this layout? I'm working with old books, so I was hoping to re-typeset them with LaTeX, as it's quite simple to create a Python script to do that.

Thanks!

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow