A hackish solution is to perform replacement only if the full stop is not immediately preceded by a digit. Change the line content = content.replace('. ', '. <br />')
to the following:
import re
re.sub(r'([^0-9])\. ', r"\1. <br />", content)
سؤال
I am coding a function about extracting text in pdf, I am also using the pyPdf library. Extracting was okay. But I am encountering a couple of problems like it excluding the newline.
So I find a way to add a newline, so I have done this:
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText()
content = content.replace('. ', '. <br />')
pages += content
# Collapse whitespace
content = " ".join(pages.replace(u"\xa0", " ").strip().split())
The problem is even instances like this:
1. Apple
became like this:
1.
Apple
Which it shouldn't be. I just want to add newline on every end of a sentence.
Is there a way to check or determine when the sentence ends? Or checking whether it is as numbering?
المحلول
A hackish solution is to perform replacement only if the full stop is not immediately preceded by a digit. Change the line content = content.replace('. ', '. <br />')
to the following:
import re
re.sub(r'([^0-9])\. ', r"\1. <br />", content)
نصائح أخرى
Why not use re.sub()?
For a dot ended line and probably with some spaces, the pattern should be ".\s*$", i.e.,
import re
:
content = re.sub('\.\s*$', '. <br />', content)
pyPdf is great for some things, but not really good at text extraction. Have a look at the pdfminer library. Or use a tool like pdftotext.