質問

I am trying to read the below pdf file and I need to save each and every article in seperate file.

https://dl.dropboxusercontent.com/u/23092311/sample.pdf

A article can be in one or more than one pages. I have used PDFMiner to convert the entire pdf to txt file. But I don't know how to convert into multiple articles.

I am new to Python. Please provide a best method or sample code to extract the each and every articles separately?

役に立ちましたか?

解決

I'll be honest. I've never used PDFMiner before, but if you already have the PDF into a text file, couldn't you just parse the text file into a string, and then use the split function to divide the string into different articles based on "The New York Times" heading? I guess that assumes PDFMiner is capable of reading that fancy font which I don't know if that is possible.

Looking at the file you provided, you could something like the following:

reading = open('test.txt')
full_paper = reading.read()
split_paper = full_paper.split('Copyright 2014 The New York Times Company. All Rights Reserved.')

split_paper would then be an array containing your articles in indexes 1, 2, 3, 4, 5, 6 (index 0 would contain the initial heading). You'd have to do some other string cleanup to get the exact articles, but that should at least get you started.

Make sense?

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top