Domanda

I'm currently working on a custom analyzer for a Mahout cluster project. Since Mahout 0.8 updated Lucene to 4.3, I have trouble to generate tokenized-document file, or SequenceFile from the book outdated sample. The following code is my revision of the example code from the book, Mahout in Action. However, it gives me illegalstateexception.

public class MyAnalyzer extends Analyzer {

private final Pattern alphabets = Pattern.compile("[a-z]+");
Version version = Version.LUCENE_43;

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new StandardTokenizer(version, reader);
    TokenStream filter = new StandardFilter(version, source);

    filter = new LowerCaseFilter(version, filter);
    filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);

    CharTermAttribute termAtt = (CharTermAttribute)filter.addAttribute(CharTermAttribute.class);
    StringBuilder buf = new StringBuilder();

    try {

        filter.reset();
        while(filter.incrementToken()){
            if(termAtt.length()>10){
                continue;
            }
            String word = new String(termAtt.buffer(), 0, termAtt.length());
            Matcher matcher = alphabets.matcher(word);
            if(matcher.matches()){
                buf.append(word).append(" ");
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    source = new WhitespaceTokenizer(version, new StringReader(buf.toString()));

    return new TokenStreamComponents(source, filter);

}

}

È stato utile?

Soluzione

Not quite sure why you have an IllegalStateException, but there are some likely possibilities. Typically your analyzer will build filters on top of the tokenizer. You do that, then create another tokenizer and pass that back instead, so the filter passed back has no direct relation to the tokenizer. Also, the Filter you have constructed is already at it's end when it's passed back, so you might try reseting it, I suppose.

The main problem though is that createComponents isn't really a great place to implement parsing logic. It's where you set up the Tokenizer and stack of Filters to do that. It would make more sense to implement your custom filtering logic in a Filter, extending TokenStream (or AttributeSource, or some such).

I think what you are looking for has already been implemented, though, in PatternReplaceCharFilter:

private final Pattern nonAlpha = Pattern.compile(".*[^a-z].*");
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new StandardTokenizer(version, reader);
    TokenStream filter = new StandardFilter(version, source);
    filter = new LowerCaseFilter(version, filter);
    filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
    filter = new PatternReplaceCharFilter(nonAlpha, "", filter);
    return new TokenStreamComponents(source, filter);
}

or perhaps something still simpler, like this, would serve:

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new LowerCaseTokenizer(version, reader);
    TokenStream filter = new StopFilter(version, filter, StandardAnalyzer.STOP_WORDS_SET);
    return new TokenStreamComponents(source, filter);
}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top