Lucene wrong match

https://stackoverflow.com/questions/23484652

java
lucene

16-07-2023
|

質問

I have a csvfile

 id|name
    1|PC
    2|Activation
    3|USB

    public class TESTResult
    {
                    private Long id;
        private String name;
        private Float score;
        // with  setters & getters
    }


    public class TEST
    {
                    private Long id;
        private String name;

    // with  setters & getters
    }

    public class JobTESTTagger {
                    private static Version VERSION;
                    private static CharArraySet STOPWORDS;
                    private static RewriteMethod REWRITEMETHOD;
                    private static Float MINSCORE = 0.0001F;
                    static {
                                    BooleanQuery.setMaxClauseCount(100000);
                                    VERSION = Version.LUCENE_44;
                                    STOPWORDS = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
                                    REWRITEMETHOD = MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE;
                    }

                    public static ArrayList<TESTResult> searchText(String text, String keyId,
                                                    List<TEST> TESTs) {
                                    ArrayList<TESTResult> results = new ArrayList<TESTResult>();
                                    MemoryIndex index = new MemoryIndex();
            EnglishAnalyzer englishAnalyzer = new EnglishAnalyzer(VERSION,STOPWORDS);
                   QueryParser parser = new QueryParser(VERSION, "text", englishAnalyzer);

                                    parser.setMultiTermRewriteMethod(REWRITEMETHOD);
                                    index.addField("text", text, englishAnalyzer);
                                    for (int i = 0; i < TESTs.size(); i++) {
                                                    TEST TEST = TESTs.get(i);
                                       String criteria = "\"" + TEST.getName().trim() + "\"";
                                                    if (criteria == null || criteria.isEmpty())
                                                                    continue;

                                                    criteria = criteria.replaceAll("\r", " ");
                                                    criteria = criteria.replaceAll("\n", " ");

                                                    try {
                                               Query query = parser.parse(criteria);
                                                   Float score = index.search(query);
                                                                    if (score > MINSCORE) {
   int result = new TESTResult(TEST.getId(),                                                                                                      TEST.getName(),score);
                                                                                    results.add(result);
                                                                    }

                                                    } catch (ParseException e) {
                                                                    System.out.println("Could not parse article.");
                                                    }
                                    }
                                    return results;
                    }

    public static void main(String[] args) {
    ArrayList<TESTResult> testresults = searchText(text, keyId, iths);
    CsvReader reader = new CsvReader("C:\a.csv");
    reader.setDelimiter('|');
      reader.readHeaders();

 List<TEST> result = new ArrayList<TEST>();
                         while (reader.readRecord()) {
                          Long id = Long.valueOf(reader.get("id").trim());
                         String name = reader.get("name").trim();
                            TEST concept = new TEST(id, name);
                            result.add(concept);
                          }

String text = "These activities are good. I have a good PC in my house."; }

I am matching 'activities' to Activation. How is it possible. Can anybody tell me how Lucene matches the words.

Thanks R

解決

EnglishAnalyzer, along with most language-specific analyzers, uses a stemmer. This means that it reduces terms to a stem (or root) of the term, in order to attempt to match more loosely. Mostly this works well, removing suffixes and matching up derived words to a common root. So when I search for "fish", I also find "fished", "fishing" and "fishes".

In this case though, both "activities" and "activation" both reduce to the root of "activ", resulting in the match you are seeing. Another example: "organ", "organic" and "organize" all have the common stem "organ".

You can stem or not, neither approach is perfect. If you don't stem you'll miss relevant results. If you do, you'll hit some odd irrelevant results.

To deal with specific problematic cases, you can define a stemmer exclusion set in EnglishAnalyzer to prevent stemming just on those specific problematic terms. In this case, I would think of "activation" as the probable term to prevent stemming on, though you could go either way. So I could do something like:

CharArraySet stemExclusionSet = new CharArraySet(VERSION, 1, true);
stemExclusionSet.add("activation");
EnglishAnalyzer englishAnalyzer = new EnglishAnalyzer(VERSION, STOPWORDS, stemExclusionSet);

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow