How to use Bayesian analysis to compute and combine weights for multiple rules to identify books

https://stackoverflow.com/questions/7770794

09-02-2021
|

Question

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.

Some are obvious to the human reader, like:

Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt

Others are not so obvious:

Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)

Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.

For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.

Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.

I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.

In the end I would have a set of tuples like this:

[rulename,hypothesis]

I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.

My questions are:

Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?

Solution

It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow