Lucene text search failing for names ending with a non-alphabet

https://stackoverflow.com/questions/13702913

04-12-2021
|

Question

In my Webmethods application I need to implement a Search functionality and I had done it with Lucene. But the search is not retrieving results when I am searching for file with title ending in something other than alpabet.for eg:- doc1.txt or new$.txt
In the code below when i try to print queryCmbd its printing Search Results>>>>>>>title:"doc1 txt" (contents:doc1 contents:txt).when I search for a string like doc.txt, the result is Search Results>>>>>>>title:"doc.txt" contents:doc.txt. What should be done in order to parse these kinds of strings(like doc1.txt,new$.txt)?

 public java.util.ArrayList<DocNames> searchIndex(String querystr,
                String path, StandardAnalyzer analyzer) {
            String FIELD_CONTENTS = "contents";
            String FIELD_TITLE = "title";
            String queryStringCmbd = null;

            queryStringCmbd = new String();

            String queryFinal = new String(querystr.replaceAll(" ", " AND "));
            queryStringCmbd = FIELD_TITLE + ":\"" + queryFinal + "\" OR "
                    + queryFinal;


            try {

                FSDirectory directory = FSDirectory.open(new File(path));

                Query q = new QueryParser(Version.LUCENE_36, FIELD_CONTENTS,
                        analyzer).parse(querystr);

                Query queryCmbd = new QueryParser(Version.LUCENE_36,
                        FIELD_CONTENTS, analyzer).parse(queryStringCmbd);

                int hitsPerPage = 10;
                IndexReader indexReader = IndexReader.open(directory);
                IndexSearcher indexSearcher = new IndexSearcher(indexReader);

                TopScoreDocCollector collector = TopScoreDocCollector.create(
                        hitsPerPage, true);
                indexSearcher.search(queryCmbd, collector);
                ScoreDoc[] hits = collector.topDocs().scoreDocs;

                System.out
                        .println("Search Results>>>>>>>>>>>>"
                                + queryCmbd);

                docNames = new ArrayList<DocNames>();
                for (int i = 0; i < hits.length; ++i) {
                    int docId = hits[i].doc;
                    Document d = indexSearcher.doc(docId);
                    DocNames doc = new DocNames();
                    doc.setIndex(i + 1);
                    doc.setDocName(d.get("title"));
                    doc.setDocPath(d.get("path"));
                    if (!(d.get("path").contains("indexDirectory"))) {
                        docNames.add(doc);
                    }
                }

                indexReader.flush();
                indexReader.close();
                indexSearcher.close();
                return docNames;
            } catch (CorruptIndexException e) {
                closeIndex(analyzer);
                e.printStackTrace();
                return null;
            } catch (IOException e) {
                closeIndex(analyzer);
                e.printStackTrace();
                return null;
            } catch (ParseException e) {
                closeIndex(analyzer);
                e.printStackTrace();
                return null;
            }
        }

Solution

Your problem comes from the fact you're using StandardAnalyzer. If you read its javadoc, it tells that it's using StandardTokenizer for token splitting. This means phrases like doc1.txt will be split into doc1 and txt.

If you want to match the entire text, you need to use KeywordAnalyzer- both for indexing and searching. The code below displays the difference: using StandardAnalyzer tokens are {"doc1", "txt"} and using KeywordAnalyzer the only token is doc1.txt.

    String foo = "foo:doc1.txt";
    StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_34);
    TokenStream tokenStream = sa.tokenStream("foo", new StringReader(foo));
    while (tokenStream.incrementToken()) {
        System.out.println(tokenStream.getAttribute(TermAttribute.class).term());
    }

    System.out.println("-------------");

    KeywordAnalyzer ka = new KeywordAnalyzer();
    TokenStream tokenStream2 = ka.tokenStream("foo", new StringReader(foo));
    while (tokenStream2.incrementToken()) {
        System.out.println(tokenStream2.getAttribute(TermAttribute.class).term());
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow