Question

This problem is really driving me crazy,

TO ANSWER MOST OF WHAT PEOPLE THINK: YES I ADDED snowball.jar TO THE CLASSPATH

I have a simple main class that supposed to stem the word "going" to "go":

import weka.core.stemmers.SnowballStemmer;

public class StemmerTest {
    public static void main(String[] args) {
        SnowballStemmer stemmer = new SnowballStemmer();
        stemmer.setStemmer("english");
        System.out.println(stemmer.stem("going"));
    }
}

First when I run it in eclipse it works and I get the following output:

Refreshing GOE props...
---Registering Weka Editors---
Trying to add database driver (JDBC): RmiJdbc.RJDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): jdbc.idbDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): com.mckoi.JDBCDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.hsqldb.jdbcDriver - Warning, not in CLASSPATH?
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...
go

However when I export it as a runnable jar from eclipse "stem.jar" and execute it in the terminal "java -jar stem.jar" it doesn't work and I get the following output:

Refreshing GOE props...
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
going

I have no idea why the snowball.jar is not recognized in the exported jar ... Although both weka.jar and snowball.jar are included in the exported jar. Here is the stem.jar file structure:

stem.jar
       |
       |---META-INF
       |---org
       |---StemmerTest.class
       |---snowball.jar
       |---weka.jar

I would appreciate any help with the problem

EDIT 1: Generated ANT Script:

<project default="create_run_jar" name="Create Runnable Jar for Project StemmerTest with Jar-in-Jar Loader">
<!--this file was created by Eclipse Runnable JAR Export Wizard-->
<!--ANT 1.7 is required                                        -->
<target name="create_run_jar">
    <jar destfile="stem.jar">
        <manifest>
            <attribute name="Main-Class" value="org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader"/>
            <attribute name="Rsrc-Main-Class" value="StemmerTest"/>
            <attribute name="Class-Path" value="."/>
            <attribute name="Rsrc-Class-Path" value="./ snowball-2012.jar weka.jar snowball.jar"/>
        </manifest>
        <zipfileset src="jar-in-jar-loader.zip"/>
        <zipfileset dir="resources/lib" includes="snowball-2012.jar"/>
        <fileset dir="bin"/>
        <zipfileset dir="." includes="weka.jar"/>
        <zipfileset dir="." includes="snowball.jar"/>
    </jar>
</target>

EDIT 2:

Here is the content of MANIFEST.MF as requested.

Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.1
Created-By: 23.25-b01 (Oracle Corporation)
Main-Class: org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader
Rsrc-Main-Class: StemmerTest
Rsrc-Class-Path: ./ weka.jar snowball.jar
Class-Path: .

Thanks in Advance, TeFa

Was it helpful?

Solution

Although it is not clear for me, I managed to solve this annoying problem (after ~10 hours -.-) by doing the following:-

  • Using "zipgroupfileset" instead of "fileset" for "snowball.jar" to flatten the content in the generated jar file.

  • Exclude "snowball.jar" from the classpath (Since its already included in the generated jar file).

For some UNKNOWN reason, the snowball wrapper in weka.jar couldn't find snowball.jar until its flattened (extracted).

Here is the ant script that works for me:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project default="jar">
    <path id="dep.runtime">
        <fileset dir="./libs">
            <include name="**/*.jar" />
            <exclude name="**/snowball.jar"/>
        </fileset>
    </path>

    <manifestclasspath property="manifest_cp" jarfile="stem.jar">
        <classpath refid="dep.runtime" />
    </manifestclasspath>

    <target name="jar">
        <jar destfile="stem.jar">
            <manifest>
                <attribute name="Main-Class" value="StemmerTest"/>
                <attribute name="Class-Path" value="${manifest_cp}"/>
            </manifest>
            <zipgroupfileset dir="./libs" includes="snowball.jar"/>
            <fileset dir="bin"/>
        </jar>
    </target>
</project>

Hope this helps if someone is using snowball stemmer.

OTHER TIPS

I did it after 1hour of tests, as there's nothing on that matter at the wiki. The solution goes like this:

SnowballStemmer stemmer = new SnowballStemmer();
stemmer.setStemmer("English");
StringToWordVector STWfilter = new StringToWordVector(1000);
STWfilter.setUseStoplist(true);
STWfilter.setIDFTransform(true);
STWfilter.setTFTransform(true);
STWfilter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL, StringToWordVector.TAGS_FILTER));
STWfilter.setOutputWordCounts(true);
STWfilter.setStemmer(stemmer);
STWfilter.setInputFormat(train);

I posted the whole example so that it saves you the 1h I spent on doing this the right way.

I had the same problem with Snowball using multithreading. I solved it like this:

SnowballStemmer st = new SnowballStemmer();
do{
            //wait until the German stemmer is initialized
}while(!st.stemmerTipText().contains("german"));
st.setStemmer("german");
filter.setStemmer(st);

The error message "Stemmer 'porter' unknown!" will stay, but it will set i.e. the German stemmer correctly.

I have followed this method and it has worked. My IDE is NetBeans. I have downloaded the jar from here. It is the second option under title of Snowball stemmers. I have added it to my class path and used following code to add stemmer into filter.

SnowballStemmer stemmer = new SnowballStemmer();
stemmer.setStemmer("english");
StringToWordVector filter = new StringToWordVector();
filter.setStemmer(stemmer);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top