Mining pdf Data with python through clipboard - Python Scripting the OS

https://stackoverflow.com/questions/18636740

27-06-2022
|

Frage

I have written a script that extracts data from pdf. I am using the win32clipboard module to copy the the data into python. Got the logic working on how to get the data I need in each file.

The shortcoming of my process is that I have to open each pdf Ctr-A to Select all then Ctrl-C to get it into clipboard. I then run my script. for reference it is running within Excel using DataNitro.

I have tried PDFMiner, but it seems like it is not being maintained and tend break the text into small bits. The PDF that I am mining contain lots of "small" tables. the copy from clipboard seem to do a pretty descent job of keeping related things together.

Any suggestions on how I can script the opening of PDF selecting all and copying. Basically I am looking for a python way to script the OS. Gut feel is that this is not possible, but maybe somebody knows.

Lösung

I have settled on using pyPdf. It has a simple method that just extracts the text from the pdf. I have written simple functions to find the relevant information I need in this text. Splitting the text into list for easy data identification.

Have also written a loop to to pick up the relevant files using glob search and feeding it into the parser.

import pyPdf
pdf = pyPdf.PdfFileReader(open(filename, "rb"))
data = ''
for page in pdf.pages:
   data += page.extractText()
data2 = data.split('\n')

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow