Getting textcontent pdf.js

https://stackoverflow.com/questions/20598292

02-09-2022
|

题

I'm trying to get the text from a pdf document using pdf.js in JS. However, pdf.js has no decent documentation, i've looked at the available examples, and I came up to this:

var pdfUrl = "http://localhost/test.pdf"
var pdf = PDFJS.getDocument(pdfUrl);
pdf.then(function(pdf) {
    var maxPages = pdf.pdfInfo.numPages;
    for (var j = 1; j < maxPages; j++) {
        var page = pdf.getPage(j);

        page.then(function() {
            var textContent = page.getTextContent();

        })
    }
});

The page bit is working, because I can see it is a promiss. However, running this bit gives:

Warning: Unhandled rejection: TypeError: Object #<Object> has no method 'getTextContent'
TypeError: Object #<Object> has no method 'getTextContent'

It is working this way in examples i've seen. It is getting the page, and I can print out number of pages.

Anyone with experience who can shed a light?

*Bonus question: I'm only interested in parsing pdf, not in rendering it in browser. However it has to be done clientside. Is pdf.js the right hammer for the job?

解决方案

page.then(function() { should be page.then(function(page) {

其他提示

PDF.js renders your pdf file and generates words then outputs them as html elements . Each element is then placed above your pdf with css property {position:absolute;left:X,top:Y} and masked over your pdf.

These divs are given css property {color:transparent}. this does the trick of selection highlighting, it appears that you are directly selecting from the pdf file but actually you are selecting the created html elements.

this is exactly how it works, if you want to render the pdf file it is okay but keep it in your mind that if you wanted to change the output technique (html transparent divs) you have to bring your own replacement...

You also need to change it to

for (var j = 1; j <= maxPages; j++) {

otherwise you'll never get the first page.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow