How to get pypdf to read page content line by line?

https://stackoverflow.com/questions/15459802

24-03-2022
|

Question

I have a pdf in which every page contains an address. The addresses are in this format:

Location Name

Street Address

City, State Zip

for example:

The Gift Store

620 Broadway Street

Van Buren, AR 72956

Each and every address is in this format only and each is on a different page of the pdf.

I need to extract the address information and store the results in an excel/csv file. I need the entries to be separate for each field of information. My excel sheet needs to have Location Name, Street Address, City, State, Zip all in different columns. I am using pyPdf in python.

I have used the following code to do this, but my code is not considering the newline; instead it gives the whole data of a single page as a continuous string.

import pyPdf  
def getPDFConten(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(9, num_pages):
        x = pdf.getPage(i).extractText()+'\n' 
        content += x

    content = " ".join(content.replace(u"\xa0", " ").strip().split())     
    return content

con = getPDFContent("document.pdf")
print con

or my above example it gives "The Gift Store 620 Broadway Street Van Buren, AR 72956".

If I can read the input line by line then I can easily get the Location Name and Stree Address from the first two lines and the rest from the third line using substrings.

I tried to use the solution listed [here(pyPdf ignores newlines in PDF file) but it didn't work for me. I also tried to use pdfminer: it can extract information line by line but it converts the pdf to text file first and I don't want to do it. I want to do it use pyPdf only. Can anyone suggest where I am wrong or what I am missing? Is this possible to do using pyPdf?

Solution

You could try using subprocess to call pdftotext (probably with the -layout option) from the poppler utilities. It has worked much better for me than using pypdf.

For example I've used the following code to extract CAS numbers from a PDF file:

import subprocess
import re

def findCAS(pdf, page=None):
    '''Find all CAS numbers on the numbered page of a file.

    Arguments:
    pdf -- Name of the PDF file to search
    page -- number of the page to search. if None, search all pages.
    '''
    if page == None:
        args = ['pdftotext', '-layout', '-q', pdf, '-']
    else:
        args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
                '-q', pdf, '-']
    txt = subprocess.check_output(args)
    candidates =  re.findall('\d{2,6}-\d{2}-\d{1}', txt)
    checked = [x.lstrip('0') for x in candidates if checkCAS(x)]
    return list(set(checked))

def checkCAS(cas):
    '''Check if a string is a valid CAS number.

    Arguments:
    cas -- string to check
    '''
    nums = cas[::-1].replace('-', '') # all digits in reverse order
    checksum = int(nums[0]) # first digit is the checksum
    som = 0
    # Checksum method from: http://nl.wikipedia.org/wiki/CAS-nummer
    for n, d in enumerate(nums[1:]):
        som += (n+1)*int(d)
    return som % 10 == checksum

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow