Question

I've created a simple index using Zend_Search_Lucene for searching a list of company names, as I want to be able to offer a search which is more intelligent than a simple MySQL 'LIKE %query%'. I've used the code below, where 'companyname' is the company name and 'document_id' is a unique ID for each document (I'm aware that Lucene assigns one internally, but I understand that can change, whereas my document ID will be static).

$index = Zend_Search_Lucene::create('test-index');

$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 1));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'XYZ Holdings'));
$index->addDocument($document);

$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 2));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'X.Y.Z. (Holdings) Ltd'));
$index->addDocument($document);

$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 3));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'X Y Z Ltd'));
$index->addDocument($document);

$index->commit();

However, when I run the following code to find all companies with variants of 'XYZ' in their name:

$index = Zend_Search_Lucene::open('test-index');
$hits = $index->find('companyname:XYZ');
foreach ($hits as $hit)
{
  print "ID: " . $hit->document_id . "\n";
  print "Score: " . $hit->score . "\n";
  print "Company: " . $hit->companyname . "\n";
}

I end up with the following:

ID: 1
Score: 1
Company: XYZ Holdings

I was expecting XYZ to match all the documents, as the point of having this search is to pick up companies which are have the same name but slightly different punctuation, which can't be catered for in a simple LIKE clause. Is there a reason why Lucene doesn't match all the documents, and is there something I can do to fix this?

I get the same sort of problem if I search for 'companyname:"x.y.z holding"' - this doesn't match anything but 'companyname:"x.y.z holdings"' does. I'd expect Lucene to work out that 'holding' and 'holdings' are sufficiently close to be considered a match.

I'm fairly sure all the documents are indexed because if I search for 'X.Y.Z' I get matches for documents 2 and 3.

Edit: Forgot to mention PHP version (5.3.5-1ubuntu7.4 with Suhosin-Patch) and Zend Framework version (1.11.10-0ubuntu1).

Was it helpful?

Solution

You can fix the issue by preprocessing your content before indexing it. Lucene will work with tokens and you need to treat them as individual units. I did something similar in the past to match version numbers so that searching for 2.0 would also provide 2.0.3 for example, but not 1.2.0.

The toCanonical() function here is not perfect. I recommend you write your own and build a test suite to make sure it converts the text as you expect. What it does is build a longer string by grouping the things that look like acronyms. You can also call it on the search query.

You will need to search in companyname_canonical instead of companyname.

There may be a cleaner way to do it as a filter within Zend Lucene. You might also want to use a stemmer to handle the plural forms and such. There is an implementation of the porter stemmer already written. http://codefury.net/2008/06/a-stemming-analyzer-for-zends-php-lucene/

function toCanonical($text)
{
    $out = $text . ' ';
    $step = $text;

    $pattern = '/([A-Z])[\s\.-]([A-Z])([^a-z])/';
    while (preg_match($pattern, $step)) {
        $step = preg_replace($pattern, '$1$2$3', $step);
        $out .= $step . ' ';
    }

    return $out;
}

function createDocument($id, $companyName)
{
    $canonicalName = toCanonical($companyName);

    $document = new Zend_Search_Lucene_Document();
    $document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', $id));
    $document->addField(Zend_Search_Lucene_Field::Text('companyname', $companyName));
    $document->addField(Zend_Search_Lucene_Field::UnStored('companyname_canonical', $canonicalName));

}

$index->addDocument(createDocument(1, 'XYZ Holdings'));
$index->addDocument(createDocument(1, 'X.Y.Z. (Holding) Company'));

OTHER TIPS

when you index "XYZ Holdings" (say you are using standardAnalyzer), then there will be two tokens "xyz" and "holdings"

In case of "X.Y.Z. (Holdings) Ltd" & there will be "x", "y", "z", "holdings" and "ltd"

In case of "X Y Z Ltd" tokens will be "x", "y", "z" and "ltd"

When you issue companyname:"X.Y.Z" or companyname:"X Y Z" both case 2 and case 3 match. There's no way lucene can know that XYZ in case 1 is also an acronym.

I think you should write your own tokenizer to generate same tokens for "XYZ", "X.Y.Z" and "X Y Z", but this might interfere with other uppercase words that aren't acronyms

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top