Classifying website type from webpages

https://stackoverflow.com/questions/4649625

09-10-2019
|

Question

Are there any reliable/deployed approaches, algorithms or tools to tagging the website type by parsing some its webpages.

For ex: forums, blogs, PressRelease sites, news, E-Comm etc.

I am looking for some well-defined characteristics (Static rules) from which this can be determined. If not, then i hope Machine Learning model may help.

Suggestions/Ideas ?

Solution

If you approach this from machine learning standpoint, Naive Bayes classifier probably has the greatest work/payoff ratio. A version of it is used in Winnow to categorize news articles.

You will need a collection of pages, each tagged with it's proper category. Then you extract words or other relevant elements from each page and use them as features

Dr.Dobbs has an article on implementing Naive Bayes

OTHER TIPS

If you're interested in persuing the naïve Bayes approach (there are other machine learning options, after all), then I suggest the following document, which follows the coverage of this subject in "Data Mining: Practical Machine Learning Tools and Techniques", by Witten and Frank:

http://www.coli.uni-sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow