質問

I'm looking for a solution for PDF to HTML and OCR service in the cloud or in the SDK format. After my searches, I see that there are bunch of services out there in the internet. I tried some of them and I got some idea. I'd like to know that if any of you use such service.

My biggest concerns are to have a automation structure to have an HTML output that I can use in the information extraction. I'd like to have structured data output like tables. (most of the services provide HTML output with the -character format (CSS/HTML tag for each char) or -paragraph format (CSS/HTML for each line).

I checked so far :

  • Abbyy Cloud SDK (They don't have PDF-to-HTML service but PDF-to-XML that can be covertable to HTML with XSLT support (maybe). Also OCR service with text output is quite good)
  • cloudconvert.org (They are providing the same results as Ubuntu pdftohtml command which is based on poppler-Xpdf3.0)
  • pdftohtml commamd (Tested on Ubuntu) - I got a result with full of < p >.
  • aspose.PDF (They don't have PDF-to-HTML service in the cloud but they have good integration with GDrive, Dropbox and Amazon s3.
  • PdfNET of PDFTron : I got a result with complex CSS and HTML structure with almost a tag per character.

My question is if you know any other service worth to try and get structural HTML output for data extraction.

Thanks in advance.

正しい解決策はありません

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top