Question

I'm making a system that looks through articles about different stuff and picks out some description about it. Basically a lot like a encyclopaedia. At first I ran into a problem where if I searched for "cat" I got a lot of hits to articles like "CAT5", "CAT6", ".cat" and so on. The number one hit was however still "Cat". I was using StandardAnalyzer for this. I received a tip to use WhitespaceAnalyzer instead which solved the original problem and made Lucene drop hits on articles like CAT6, but now the article "Cat" is no longer in my list of hits at all. Why is this? Any suggestions to for example a different analyzer?

EDIT: The code for the search itself:

public static String searchAbstracts(String input, int hitsPerPage) throws ParseException, IOException {
    String query = input;
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);
    Query q = new QueryParser(Version.LUCENE_41, "article", analyzer).parse(query);
    Directory index = new NIOFSDirectory(new File(INDEX_PATH));
    IndexReader reader = IndexReader.open(index);
    String resultSet = "";

    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " articles.");

    for(int i=0;i<hits.length;++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        resultSet += d.get("desc") + " ";
        System.out.println((i + 1) + ". " + d.get("article") + " :: Words from abstract: " + d.get("desc"));
    }
    return resultSet;
}
Was it helpful?

Solution

When you run a sentence : "The quick Cat jumped over the lazy CAT6" through WhitespaceAnalyzer this is what it does to it:
[The] [quick] [Cat] [jumped] [over] [the] [lazy] [CAT6]

As you can see "Cat" is clearly with true case in the list, you should be able to find it. How are you querying it? During query what analyzer are you using?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top