Parole inglesi sterili con Lucene

https://stackoverflow.com/questions/5391840

28-10-2019
|

Domanda

Sto elaborando alcuni testi inglesi in un'applicazione Java e ho bisogno di arginare. Ad esempio, dal testo "servizi/servizi" ho bisogno di ottenere "amenit".

La funzione sembra:

String stemTerm(String term){
   ...
}

Ho trovato l'analizzatore Lucene, ma sembra troppo complicato per ciò di cui ho bisogno.http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/analysis/porterstemfilter.html

C'è un modo per usarlo per arginare le parole senza costruire un analizzatore? Non capisco tutto il business dell'analizzatore ...

MODIFICARE: In realtà ho bisogno di una lemmatizzazione Steming +. Lucene può farlo?

Soluzione

import org.apache.lucene.analysis.PorterStemmer;
...
String stemTerm (String term) {
    PorterStemmer stemmer = new PorterStemmer();
    return stemmer.stem(term);
}

Vedere qui per ulteriori dettagli. Se lo stero è tutto ciò che vuoi fare, allora dovresti usare questo invece di Lucene.

Modificare: Dovresti minuscolo term prima di passarlo a stem().

Altri suggerimenti

SnowBallalanyzer è deprecato, puoi usare invece Lucene Porter Stemmer:

 PorterStemmer stem = new PorterStemmer();
 stem.setCurrent(word);
 stem.stem();
 String result = stem.getCurrent();

Spero questo aiuto!

Perché non stai usando il "inglese analizzatore"? È semplice usarlo e penso che risolverebbe il tuo problema:

EnglishAnalyzer en_an = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "your_field", en_an);
String str = "amenities";
System.out.println("result: " + parser.parse(str)); //amenit

Spero che ti aiuti!

L'esempio precedente si applica a una query di ricerca, quindi se sei interessante per arginare un testo completo puoi provare quanto segue:

import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.snowball.*;
import org.apache.lucene.util.*;
...
public class Stemmer{
    public static String Stem(String text, String language){
        StringBuffer result = new StringBuffer();
        if (text!=null && text.trim().length()>0){
            StringReader tReader = new StringReader(text);
            Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_35,language);
            TokenStream tStream = analyzer.tokenStream("contents", tReader);
            TermAttribute term = tStream.addAttribute(TermAttribute.class);

            try {
                while (tStream.incrementToken()){
                    result.append(term.term());
                    result.append(" ");
                }
            } catch (IOException ioe){
                System.out.println("Error: "+ioe.getMessage());
            }
        }

        // If, for some reason, the stemming did not happen, return the original text
        if (result.length()==0)
            result.append(text);
        return result.toString().trim();
    }

    public static void main (String[] args){
        Stemmer.Stem("Michele Bachmann amenities pressed her allegations that the former head of her Iowa presidential bid was bribed by the campaign of rival Ron Paul to endorse him, even as one of her own aides denied the charge.", "English");
    }
}

La classe Termattribute è stata deprecata e non sarà più supportata in Lucene 4, ma la documentazione non è chiara su cosa usare al suo posto.

Anche nel primo esempio il Porterstemmer non è disponibile come classe (nascosto), quindi non è possibile utilizzarlo direttamente.

Spero che sia di aiuto.

Ecco come puoi usare Snowball Stemmer in Java:

import org.tartarus.snowball.ext.EnglishStemmer;

EnglishStemmer english = new EnglishStemmer();
String[] words = tokenizer("bank banker banking");
for(int i = 0; i < words.length; i++){
        english.setCurrent(words[i]);
        english.stem();
        System.out.println(english.getCurrent());
}

Tubo di ling Fornisce un certo numero di tokenizzatori. Possono essere usati per arginare e fermare la rimozione delle parole. È un mezzo semplice ed efficace per derivare.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow