Reading pdf contents using Python [closed]

https://stackoverflow.com/questions/21337002

02-10-2022
|

Domanda

I am trying to read the below pdf file and I need to save each and every article in seperate file.

https://dl.dropboxusercontent.com/u/23092311/sample.pdf

A article can be in one or more than one pages. I have used PDFMiner to convert the entire pdf to txt file. But I don't know how to convert into multiple articles.

I am new to Python. Please provide a best method or sample code to extract the each and every articles separately?

Soluzione

I'll be honest. I've never used PDFMiner before, but if you already have the PDF into a text file, couldn't you just parse the text file into a string, and then use the split function to divide the string into different articles based on "The New York Times" heading? I guess that assumes PDFMiner is capable of reading that fancy font which I don't know if that is possible.

Looking at the file you provided, you could something like the following:

reading = open('test.txt')
full_paper = reading.read()
split_paper = full_paper.split('Copyright 2014 The New York Times Company. All Rights Reserved.')

split_paper would then be an array containing your articles in indexes 1, 2, 3, 4, 5, 6 (index 0 would contain the initial heading). You'd have to do some other string cleanup to get the exact articles, but that should at least get you started.

Make sense?

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow