How to detect a website is a squatter?
-
20-12-2019 - |
Question
I'm trying to detect whether certain websites are "valid" websites. Some things that make a website invalid:
- Gives back bad status codes
- Page content is empty
- Website is a squatter (for example, the url points to a GoDaddy page, or any page that says come register this domain!)
I'm trying to figure out how to detect whether a website is a squatter. I'm using Java if that matters. Any ideas?
Solution
Sound like a good task for Machine Learning in my opinion.
Collect a sample of websites, some of them are 'squatters' and some of them are not (this is called the train set).
Use the bag of words model, or the tf-idf model (or any other model) as your features-space, and train a classifier using some supervised learning algorithm (SVM, decision trees,...).
On run time, use your classifier to determine if a website is a squatter or not.
Weka is a java library that implements many Machine Learning algorithm, and might help you.