how to extract text from pdf using mupdf?

https://stackoverflow.com/questions/18975881

29-06-2022
|

Question

I want to extract text from pdf and relayout it. My code is the following:

BOOL CTextEditorDoc::loadTxt()
{
    if(m_strPDFPath.IsEmpty())
        return FALSE;

#ifdef _DEBUG
    DWORD dwTick = GetTickCount();
    CString strLog;
#endif

    CString strFile;
    fz_context *ctx;
    fz_document* doc;

    fz_matrix ctm;
    fz_page *page;
    fz_device *dev;
    fz_text_page *text;
    fz_text_sheet *sheet;
    int i,line,rotation,pagecount;

    if(!gb2312toutf8(m_strPDFPath,strFile))
        return FALSE;

    ctx = fz_new_context(NULL, NULL, FZ_STORE_UNLIMITED);
    fz_try(ctx){
        doc = fz_open_document(ctx, strFile.GetBuffer(0));
    }fz_catch(ctx){
        fz_free_context(ctx);
        return FALSE;
    }

    line = 0;
    rotation = 0;
    pagecount = 0;
    pagecount = fz_count_pages(doc);

    fz_rotate(&ctm, rotation);
    fz_pre_scale(&ctm,1.0f,1.0f);

    sheet = fz_new_text_sheet(ctx);
    for(i=0;i<pagecount;i++){
        page = fz_load_page(doc,i);
        text = fz_new_text_page(ctx);
        dev = fz_new_text_device(ctx, sheet, text);

#ifdef _DEBUG
        dwTick = GetTickCount();
#endif
        fz_run_page(doc, page, dev, &ctm, NULL);

#ifdef _DEBUG
        strLog.Format("run page:%d ms\n",GetTickCount() - dwTick);
        OutputDebugString(strLog);
        dwTick = GetTickCount();
#endif

        //m_linesInfoVector.push_back(line);
        print_text_page(ctx,m_strContent,text,line);

#ifdef _DEBUG
        strLog.Format("print text:%d ms\n",GetTickCount() - dwTick);
        OutputDebugString(strLog);
        dwTick = GetTickCount();
#endif

        fz_free_device(dev);
        fz_free_text_page(ctx,text);
        fz_free_page(doc, page);
    }

    fz_free_text_sheet(ctx,sheet);
    fz_close_document(doc);
    fz_free_context(ctx);
    return TRUE;
}

This code can extract all the text of pdf but it may be too slow. How to improve it? Most of time is spent in function fz_run_page. Maybe just to extract text from pdf, I don't need to execute fz_run_page?

Solution

At a quick glance your code looks fine.

To extract text from a PDF you need to interpret the PDF operator streams. fz_run_page does this. It results in calls to whatever device you specify - in this case the structured text extraction device. This collates the randomly positioned glyphs from all over the page into a more structure form of words/lines/paragraphs/columns etc.

So, in short you're doing the right thing.

There are no current user servicable ways to improve the speed of this. It is possible that we could maybe use a device hint to avoid reading images etc in future versions. I will ponder on this and discuss it with the other devs. But for now you're doing the right thing.

HTH.

OTHER TIPS

No, the fz_run_page call is needed. You need to interpret the pages of the document to pull out the text, and that is what fz_run_page does.

Possibly you could create a simpler text device that avoided keeping track of the character positions, but I doubt that that would make an real difference to performance.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow