Web Crawling: Assigning a score to a URL (using its words composing it) given statistics of words previously crawled

Question 1

I wrote a Web crawler that did something very similar to what I think you're talking about. There are two pieces to the puzzle. They're related, but you have to keep them separate or thinking about the thing gets confusing.

The first thing is what we called the "nugget function." That is the function that determines if the page is or is not relevant. In our case it was pretty simple because we were looking for media files (music and video files of all types). If the file we downloaded was a media file then the nugget function returned 1.0. If the file was not a known (by us) media file type, the nugget function returned 0.0.

The other piece is a simple Bayesian classifier. Whenever we found a nugget, we'd add 1 to the good result count for each term in the URL. If the nugget function returned 0, then we'd add 1 to the bad result count for each term in the URL.

With those term counts, we could run any URL through the classifier and it would give us a probability from 0.0 to 1.0 of whether the URL would lead us to a media file.

After the first cut, which worked surprisingly well, we added other things to the terms list, most importantly the anchor text associated with the URL, and the text of the HTML page that provided the link. Those improved the classifier significantly, but the primary gain was just from the terms in the URL itself. For us, the host name was a very important piece. It turns out that hosts that have provided good content in the past typically continue to provide good content.

For your purposes, you probably want your nugget function to return a relevance score between 0.0 and 1.0, and use that to add to the good or bad counts in your terms.

You'll also have to prune your terms list on a regular basis because approximately 60% of the terms you'll see will be hapaxes; they'll occur only once. You'll have to prune uncommon terms from your terms list to make room for new terms. There's a lot of churn in that list when you're crawling the Web.

We did some testing with and without our classifier. Without the classifier, approximately 1 in 500 links that we visited was a media file. With the first cut of the simple classifier that looked only at URL terms, one document in every 70 was a media file. Adding the other terms (anchor text and page text) brought that down to about one in 50. Note that this was without crawling YouTube. We used a separate technique to get new videos from YouTube.

My original cut was a naive Bayes classifier that I put together after studying the description of Bayesian classification in the book Ending Spam. That's right, I used a spam filter to determine which URLs to crawl.

I wrote a little bit about the crawler a few years ago on my blog. I had planned a long series, but got sidetracked and never finished it. You might find some useful information there, although the treatment is from a fairly high level. See Writing a Web Crawler for an introduction and links to the other parts.

Question 2

The link text is more important than the text of the document that gives the link, e.g. a page with outgoing links that is about panda bears may have a link that says "More about panda bears", or a link that says "About other animals". So you should probably be scoring linking text separate from the entire document text, and then combining the scores to get a score for the link. As for statistical modelling methods you may be able to employ, you can look into topic modelling but I think it's overkill for your problem. At any rate, such methods usually just use the word count vectors as input to compute the probability of a given topic.