Optimizing Keyword Weights for a Web Crawler

Question

So, given a training set of relevant and irrelevant domains, you'd like to build a model which classifies new domains to one of these categories. I assume the features you will be using are the terms appearing in the domains, i.e. this is can be framed as a document classification problem.

Generally, you are correct in assuming that letting statistical-based machine learning algorithms to do the "scoring" for you works better than assigning manual scores to keywords.

A simple way to approach the problem would be to using Bayesian learning, and specifically, Naive Bayes might be a good fit.

After generating a dataset from the domains you've manually tagged (e.g. collecting several pages from each domain and treating each as a document), you can experiment various algorithms using one of the machine learning frameworks, e.g. WEKA.

A primer on how to handle and load text documents to WEKA can be found here. After the data is loaded, you can use the framework to experiment with various classification algorithms, e.g. Naive Bayes, SVM, etc. Once you've found the method best fitting your needs, you can export the resulting model and use it via WEKA's Java API.