newline in text extraction from pdf

https://stackoverflow.com/questions/21622530

08-10-2022
|

سؤال

I am coding a function about extracting text in pdf, I am also using the pyPdf library. Extracting was okay. But I am encountering a couple of problems like it excluding the newline.

So I find a way to add a newline, so I have done this:

# Iterate pages
for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText()
    content = content.replace('. ', '. <br />')
    pages += content

# Collapse whitespace
content = " ".join(pages.replace(u"\xa0", " ").strip().split())

The problem is even instances like this:

1. Apple

became like this:

1.

Apple

Which it shouldn't be. I just want to add newline on every end of a sentence.

Is there a way to check or determine when the sentence ends? Or checking whether it is as numbering?

المحلول

A hackish solution is to perform replacement only if the full stop is not immediately preceded by a digit. Change the line content = content.replace('. ', '. <br />') to the following:

import re

re.sub(r'([^0-9])\. ', r"\1. <br />", content)

نصائح أخرى

Why not use re.sub()?

For a dot ended line and probably with some spaces, the pattern should be ".\s*$", i.e.,

import re
:

content = re.sub('\.\s*$', '. <br />', content)

pyPdf is great for some things, but not really good at text extraction. Have a look at the pdfminer library. Or use a tool like pdftotext.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow