Classifying website type from webpages

https://stackoverflow.com/questions/4649625

09-10-2019
|

문제

Are there any reliable/deployed approaches, algorithms or tools to tagging the website type by parsing some its webpages.

For ex: forums, blogs, PressRelease sites, news, E-Comm etc.

I am looking for some well-defined characteristics (Static rules) from which this can be determined. If not, then i hope Machine Learning model may help.

Suggestions/Ideas ?

해결책

If you approach this from machine learning standpoint, Naive Bayes classifier probably has the greatest work/payoff ratio. A version of it is used in Winnow to categorize news articles.

You will need a collection of pages, each tagged with it's proper category. Then you extract words or other relevant elements from each page and use them as features

Dr.Dobbs has an article on implementing Naive Bayes

다른 팁

If you're interested in persuing the naïve Bayes approach (there are other machine learning options, after all), then I suggest the following document, which follows the coverage of this subject in "Data Mining: Practical Machine Learning Tools and Techniques", by Witten and Frank:

http://www.coli.uni-sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow