Question

I'm playing around with writing a web crawler that scans for a specific set of keywords and then assigns a global score to each domain it encounters based on a cumulative score I assigned to each keyword (programming=1, clojure=2, javascript=-1, etc...).

I have set up my keyword scoring on a sliding scale of -10 to 10 and I have based my initial values on my own assumptions about what is and is not relevant.

I feel that my scoring model may be flawed, and I would prefer to feed a list of domains that match the criteria I'm trying to capture into an analysis tool and optimize my keyword weights based on some kind of statistical analysis.

What would be an appropriate analysis technique to generate an optimal scoring model for a list of "known good domains"? Is this problem suited for bayesian learning, monte carlo simulation, or some other technique?

Was it helpful?

Solution

So, given a training set of relevant and irrelevant domains, you'd like to build a model which classifies new domains to one of these categories. I assume the features you will be using are the terms appearing in the domains, i.e. this is can be framed as a document classification problem.

Generally, you are correct in assuming that letting statistical-based machine learning algorithms to do the "scoring" for you works better than assigning manual scores to keywords.

A simple way to approach the problem would be to using Bayesian learning, and specifically, Naive Bayes might be a good fit.

After generating a dataset from the domains you've manually tagged (e.g. collecting several pages from each domain and treating each as a document), you can experiment various algorithms using one of the machine learning frameworks, e.g. WEKA.

A primer on how to handle and load text documents to WEKA can be found here. After the data is loaded, you can use the framework to experiment with various classification algorithms, e.g. Naive Bayes, SVM, etc. Once you've found the method best fitting your needs, you can export the resulting model and use it via WEKA's Java API.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top