Question

I'm trying to detect whether certain websites are "valid" websites. Some things that make a website invalid:

  • Gives back bad status codes
  • Page content is empty
  • Website is a squatter (for example, the url points to a GoDaddy page, or any page that says come register this domain!)

I'm trying to figure out how to detect whether a website is a squatter. I'm using Java if that matters. Any ideas?

Was it helpful?

Solution

Sound like a good task for Machine Learning in my opinion.

Collect a sample of websites, some of them are 'squatters' and some of them are not (this is called the train set).

Use the bag of words model, or the tf-idf model (or any other model) as your features-space, and train a classifier using some supervised learning algorithm (SVM, decision trees,...).

On run time, use your classifier to determine if a website is a squatter or not.

Weka is a java library that implements many Machine Learning algorithm, and might help you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top