Вопрос

I want to classify 10 webpages using weka. How to convert web pages in to Weka's ARFF file format? Do i need to convert all the 10 page in to one ARFF file or Do i need to convert ARFF files for each web page i.e 10 ARFF files.

Это было полезно?

Решение

Assuming that you want to keep your HTML formatting, this is relatively easy. Just put your HTML files in separate folders/directories (each directory a class), then apply the TextDirectoryLoader converter, as explained in the Text categorization with WEKA tutorial.

Assuming that e.g. you have two classes, what you should do (and get with this procedure) is a single ARFF file with one instance per file, and the text of each file into a single field (attribute value) for a text attribute, along with the class (directory name). Then you can follow up with the StringToWordVector filter to transform documents into term vectors and perform classification.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top