Вопрос

I'm looking for a simple way to implement proximity search in java.

By Proximity search I mean how it is defined by Lucene:

Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:

"jakarta apache"~10

More specifically: as a start I would like to implement a method of the following form:

public static boolean proximityMatches(String txt, String term1, String term2, int wordDistance) {


// for the inputs:
// txt= "this is a really foo barred world", term1="foo", term2="world", wordDistance=4
// return true

// for the inputs:
// txt= "this is a really foo barred world", term1="this", term2="bar", wordDistance=1
// return false

}

Notes:

  1. I know how to write a function to satisfy the requirements I've put up there -- what I'm looking for is an accepted standard way to implement this.

Thanks.

Это было полезно?

Решение

If there's an accepted standard way to do this, it's to use Lucene. There are some regex gimmicks you can use, like this one from RegexBuddy's library (where word1 and word2 are placeholders for the search terms, and the 3 in {1,3}? is the maximum distance):

\b(?:word1(?:\W+\w+){1,3}?\W+word2|word2(?:\W+\w+){1,3}?\W+word1)\b

Trouble is, this relies on an extremely simplistic, arbitrary notion of what constitutes a word. It doesn't match contractions or hyphenated words, but it does match "words" with digits and underscores in them. You could tweak the regex to deal with those problems, but more will pop up to replace them. And ugly as it already was, each tweak makes the regex that much less readable, that much harder to maintain.

This barely scratches the surface of what full-text search engines save you from. If you have a very specific, tightly constrained task to accomplish, regexes or other "syntax-level" tools might suit. But if you need to work at the semantic level, recognizing natural-language words and phrases, you want a search engine or other dedicated tool.

Другие советы

If you are looking for the word to the left you could try this.

String str = "Lucene supports finding words are a within a specific distance away.";
boolean found = false;
int start = str.length() -1;
int end = str.length();

    while ( !found )
    {
        if ( str.substring( start, end).contains( "specific" ) )
        {
            int total = end - start;
            System.out.println( "You word has been found " + total + " characters to the left" );
            found = true;
        }
        else
        {
            start -= 1;
        }
    }
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top