lucene query fuzzy in one field and exact in another

https://stackoverflow.com/questions/22131184

lucene

19-10-2022
|

Question

Question:

How to combine an exact match in one field AND a fuzzy search in another in lucene 4.5?

Problem:

I have indexed the NGA Geonames gazetteer in a lucene index. I need to fuzzy query one field (the place name) but constrain the query to records that have a specific country code. Here is a sample query I am running
I am not using SOLR, and I have done a lot of research and trial and error, but I have no clear answers, could be that I'm just slow.

FULL_NAME_ND_RO:india AND CC1:in

I want a fuzzy search on india, but I want ONLY RECORDS THAT EXACTLY MATCH "in" (the country code)

Here is what I've tried:
1. Index every field as a textfield and boost the country code field using ^N. Still returns different country codes, and the one boosted does not always come first...
2. Index every field as text EXCEPT the country code, which I indexed as StringField. This way I get no results at all.

Here is the code that indexes the Gaz:

public void index(File outputIndexDir, File gazateerInputData, GazType type) throws Exception {
    if (!outputIndexDir.isDirectory()) {
      throw new IllegalArgumentException("outputIndexDir must be a directory.");
    }

    String indexloc = outputIndexDir + type.toString();
    Directory index = new MMapDirectory(new File(indexloc));

    Analyzer a = new StandardAnalyzer(Version.LUCENE_45);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_45, a);

    IndexWriter w = new IndexWriter(index, config);

    readFile(gazateerInputData, w, type);
    w.commit();
    w.close();

  }

  public void readFile(File gazateerInputData, IndexWriter w, GazType type) throws Exception {
    BufferedReader reader = new BufferedReader(new FileReader(gazateerInputData));
    List<String> fields = new ArrayList<String>();
    int counter = 0;
    // int langCodeIndex = 0;
    System.out.println("reading gazateer data from file...........");
    while (reader.read() != -1) {
      String line = reader.readLine();
      String[] values = line.split(type.getSeparator());
      if (counter == 0) {
        for (String columnName : values) {
          fields.add(columnName.replace("»¿", "").trim());
        }

      } else {
        Document doc = new Document();
        for (int i = 0; i < fields.size() - 1; i++) {
          if (fields.get(i).equals("CC1")) {
            doc.add(new StringField(fields.get(i), values[i], Field.Store.YES));
          } else {
            doc.add(new TextField(fields.get(i), values[i], Field.Store.YES));
          }
        }

        w.addDocument(doc);

      }
      counter++;
      if (counter % 10000 == 0) {
        w.commit();
        System.out.println(counter + " .........committed to index..............");
      }

    }
    w.commit();
    System.out.println("Completed indexing gaz! index name is: " + type.toString());
  }

here is the code for running the query

QueryParser parser = new QueryParser(Version.LUCENE_45, luceneQueryString, geonamesAnalyzer);
  Query q = parser.parse(luceneQueryString);

  TopDocs search = geonamesSearcher.search(q, rowsReturned);

geonamesAnalyzer is a StandardAnalyzer....luceneQueryString is like the query above.

Any advise would be great.

Solution

The simplest answer seems to be just run a fuzzy query with the appropriate query syntax, like:

 FULL_NAME_ND_RO:india~ AND CC1:in

However, if you need to analyze each field differently, you can do that with a PerFieldAnalyzerWrapper

Based on the comments below:

the default stopword set in StandardAnalyzer includes the word "in", so that search term is eliminated entirely from the query. The stopword set can be overridden via the appropriate StandardAnalyzer constructor:

StandardAnalyzer(Version.LUCENE_45, new CharArraySet(Version.LUCENE_45, 0, true));

Since the CC1 field is a StringField (and is, thus, not analyzed at index time), it may make sense to be sure it is not analyzed at query time either. While the above fixes the stopword issue, you may yet run into case-related or tokenization issues, for instance. KeywordAnalyzer is generally appropriate for un-analyzed fields. A PerFieldAnalyzerWrapper can be passed in to the query parser to apply different analysis rules to different fields.

Something like:

Map<String,Analyzer> analyzerPerField = new HashMap<String,Analyzer>();
analyzerPerField.put("CC1", new KeywordAnalyzer());

PerFieldAnalyzerWrapper aWrapper =
  new PerFieldAnalyzerWrapper(geonamesAnalyzer, analyzerPerField);

QueryParser parser = new QueryParser(Version.LUCENE_45, defaultField, aWrapper);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow