Is there any way to access page header,page footer and page content separately using libpoppler?

StackOverflow https://stackoverflow.com/questions/9360686

  •  28-10-2019
  •  | 
  •  

Question

I am using libpoppler to parse PDF file to plain text,and I want to output page header,page footer and content separately,how can I do this?? Is there any structure or class that hold them?

Thanks in advance!!

Was it helpful?

Solution

You can get text in a page with poppler_page_get_text(). Can you parse pure text afterwards? Here is a sample code. It's not a C++ but hope you can see the idea.

Tested on a Debian Unstable amd64, libpoppler-glib-dev 0.18.4-3, gcc 4.7.1-7

$ gcc -Wall -g -Wextra get-text.c $(pkg-config --cflags --libs poppler-glib)

#include <poppler.h>
#include <glib.h>

int main(int argc, char *argv[])
{
    GError *error = NULL;
    PopplerDocument *d;
    PopplerPage *p;
    gchar *f;
    gchar *u;

    g_type_init();

    if (argc < 2)
            g_error("oops: no file name given");

    if (g_path_is_absolute(argv[1]))
            f = argv[1];
    else
            f = g_build_filename(g_get_current_dir(), argv[1], NULL);

    u = g_filename_to_uri(f, NULL, &error);
    if (!u)
            g_error("oops: %s", error->message);

    d = poppler_document_new_from_file(u, NULL, &error);
    if (!d)
            return -1;

    p = poppler_document_get_page(d, 1);
    g_print("%s\n", poppler_page_get_text(p));

    return 0;
}

OTHER TIPS

Disclaimer: This might not be a good answer

Last time I checked libpoppler was just a good renderer that could see a pdf page as a sequence of vector drawing operations. In that sense, it should be possible for it to intercept text-drawing operations, and thus report the text somehow. But I don't think that text in the header/footer of a page be anything special from the vector point of view. Plus, I have seen a loot of very expensive pdf-to-text converter programs to fail miserably at that.

Not really. PDF has no concept of header, footer and body (unless you create tagged PDF).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top