Question

I want to classify 10 webpages using weka. How to convert web pages in to Weka's ARFF file format? Do i need to convert all the 10 page in to one ARFF file or Do i need to convert ARFF files for each web page i.e 10 ARFF files.

Was it helpful?

Solution

Assuming that you want to keep your HTML formatting, this is relatively easy. Just put your HTML files in separate folders/directories (each directory a class), then apply the TextDirectoryLoader converter, as explained in the Text categorization with WEKA tutorial.

Assuming that e.g. you have two classes, what you should do (and get with this procedure) is a single ARFF file with one instance per file, and the text of each file into a single field (attribute value) for a text attribute, along with the class (directory name). Then you can follow up with the StringToWordVector filter to transform documents into term vectors and perform classification.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top