Web Crawling: Assigning a score to a URL (using its words composing it) given statistics of words previously crawled

StackOverflow https://stackoverflow.com/questions/22357818

Question

I'm having a hard time developing an algorithm/formula to determine the score of a link given the words that compose it. This is also applicable to the context (word sentences) that wrap around the URL. For simplicity's sake, the host of the URL is not taken into account.

When processing a web document, a score is computed for that page which will be passed to the outlinks found in the page. There are input words/terms (will be called search tags from now on - may be composed of multiples words) that will determine if a given document is relevant (having a positive page score). Each term has a given weight that will determine how much score it will add to the page's score. So in general, a page's score is function of all found input search tags in the documents, the frequency of these tags in the document, and the weights of each of the tags.

If a page has a positive score (relevant), all the terms/words in the page will have its statistics updated, i.e. the page's score will be accumulated to each of the word's sum of scores. Statistics for search tags themselves will have a "boost", i.e. accumulated score will be multiplied by some constant

So given a set of terms/words (terms are multiple words; only search tags) that have been previously crawled, each of this word has these different "statistics":

  1. Accumulated Score from crawled relevant pages
  2. Number of times this word has been found in relevant pages
  3. Number of times this word has been found in irrelevant pages

Given the statistics of all found words in the crawl, how should I use these statistics to compute the score (or probability of being relevant) of a link or its cohesive text context? Or any new word statistics I should add to make use of certain statistical methods like Bayesian Classification? Any brilliant ideas? Thanks so much!

Edit: Note that statistics is the same for synonymous word, regardless of which part of speech they belong to. I will be using WordNet to implement this.

Was it helpful?

Solution

I wrote a Web crawler that did something very similar to what I think you're talking about. There are two pieces to the puzzle. They're related, but you have to keep them separate or thinking about the thing gets confusing.

The first thing is what we called the "nugget function." That is the function that determines if the page is or is not relevant. In our case it was pretty simple because we were looking for media files (music and video files of all types). If the file we downloaded was a media file then the nugget function returned 1.0. If the file was not a known (by us) media file type, the nugget function returned 0.0.

The other piece is a simple Bayesian classifier. Whenever we found a nugget, we'd add 1 to the good result count for each term in the URL. If the nugget function returned 0, then we'd add 1 to the bad result count for each term in the URL.

With those term counts, we could run any URL through the classifier and it would give us a probability from 0.0 to 1.0 of whether the URL would lead us to a media file.

After the first cut, which worked surprisingly well, we added other things to the terms list, most importantly the anchor text associated with the URL, and the text of the HTML page that provided the link. Those improved the classifier significantly, but the primary gain was just from the terms in the URL itself. For us, the host name was a very important piece. It turns out that hosts that have provided good content in the past typically continue to provide good content.

For your purposes, you probably want your nugget function to return a relevance score between 0.0 and 1.0, and use that to add to the good or bad counts in your terms.

You'll also have to prune your terms list on a regular basis because approximately 60% of the terms you'll see will be hapaxes; they'll occur only once. You'll have to prune uncommon terms from your terms list to make room for new terms. There's a lot of churn in that list when you're crawling the Web.

We did some testing with and without our classifier. Without the classifier, approximately 1 in 500 links that we visited was a media file. With the first cut of the simple classifier that looked only at URL terms, one document in every 70 was a media file. Adding the other terms (anchor text and page text) brought that down to about one in 50. Note that this was without crawling YouTube. We used a separate technique to get new videos from YouTube.

My original cut was a naive Bayes classifier that I put together after studying the description of Bayesian classification in the book Ending Spam. That's right, I used a spam filter to determine which URLs to crawl.

I wrote a little bit about the crawler a few years ago on my blog. I had planned a long series, but got sidetracked and never finished it. You might find some useful information there, although the treatment is from a fairly high level. See Writing a Web Crawler for an introduction and links to the other parts.

OTHER TIPS

The link text is more important than the text of the document that gives the link, e.g. a page with outgoing links that is about panda bears may have a link that says "More about panda bears", or a link that says "About other animals". So you should probably be scoring linking text separate from the entire document text, and then combining the scores to get a score for the link. As for statistical modelling methods you may be able to employ, you can look into topic modelling but I think it's overkill for your problem. At any rate, such methods usually just use the word count vectors as input to compute the probability of a given topic.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top