Let's say there might be indexed documents involving phrases like "Facebook acquires WhatsApp for $19 B". I want to search for "Facebook[\s\w+]*Whatsapp" and expect all the phrases that contain Facebook and WhatsApp separated by a word (acquires, buys etc).

How to do it in lucene? Is it efficient enough to work for thousands of such queries in a 50GB corpus?

p.s. So far I've experimented with regex search using RegexpQuery and I'm unable to work for a multi-word phrase. Here's a line from the code:

Term term = new Term("text", "Facebook[\\s\\w+]*Whatsapp");
Term t = new Term(userQuery);
Query query = new RegexpQuery(term);
有帮助吗?

解决方案

You could use query "Facebook Whatsapp"~1, so all documents will be matched where distance between this words will be less or equal to 1

For more information - http://wiki.apache.org/lucene-java/LuceneFAQ#Is_there_a_way_to_use_a_proximity_operator_.28like_near_or_within.29_with_Lucene.3F and http://searchhub.org//2009/07/18/the-spanquery/

UPD.

And also make sure your "text" field is a TextField so it is tokenized.

– Jeff French

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top