Classify pdf files - image approach vs. text approach

https://datascience.stackexchange.com/questions/66189

20-10-2020
|

Question

I'm about to start a project with the objective to classify PDF-documents. I'm wondering if there's a best practice approach to tackle this problem.

Concretely I'm wondering if one of the following two approaches performs usually better:

use an OCR reader to convert the files to text and train the classifier model on the text data
convert the files to images and train a CNN classifier model

I'm planning to classify different testimonials and certificates mostly. Since most of these files share a similar layout and text within the classes the ideas should both work out. I'm wondering if anybody has already made experience with this and could tell me about some advantages/disadvantages when using a specific approach.

I'd highly appreciate any kind of help.

Solution

Both methods will be beneficial for different cases.

If you think that dependencies in the text are more discriminative of the classes, then NLP apporach. Image approach in this case will need to be super complex to catch this kind of information.

On the other hand layout and position could be very informative and this it could happen is not encoded in text and only image approach can catch this information.

Conclusion. Maybe think about hard-encoding layout and position features and pass them along to NLP model.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange