Question

I have a word document that has occurrences of both "perform" and "performance". When I use the advanced find tool in the Word UI (goal to eventually translate this to the Find.Execute command for C# programmatic searching), I get difference results when i have the Match All Word Forms option checked.

When I search for "perform", I get both the occurrences for "perform" and "performance".
When I search for "performance", I only get hits for "performance", even though "perform" should still register as a word form for "performance".

Does anyone know how Word uses its search algorithm or how I could make sure searching for "performance" shows both the results for "perform" and "performance"?

Edit (7/11/12 16:34)-
I ran a couple of test combinations to see if I could find a pattern for myself, and well... it wasnt all that promising (Capitalization matters !?!).
The document these results were obtained from was a simple word document with both upper and lowercase capitalization of each word form. Each search found both the upper and lowercase versions of the word.
Here are the results of a few searches and their apparent conclusions (Do a 'RightClick->Open Image In New Tab' to see the full-size image detail)
If anyone can link to documentation clarifying this for me it would be greatly appreciated!

Edit (7/12/12 9:49)-
Even more sadness: I tried switching from the interface inside Word to the Find.Execute command in C# and the matchSoundsLike parameter does not function in the same way that the UI Advanced Find functions :( It seems that the programmatic matchSoundsLike flag only finds sounds-like forms that match case, even though I have matchCase explicitly set to false.

Was it helpful?

Solution

This seems to be an adaptation of Query Expansion, a rather important area in Information Retrieval.

I would advise against building a query expansion engine yourself as that's more of a project for a Masters (or possibly a PhD) thesis than a smaller feature of a larger project. However, if you still wish to implement this feature yourself I suggest you start with a Google Scholar search for "query expansion" and read up on some of the modern techniques.

As far as pre-existing libraries, most packages focus on web searches and databases so I'm having a hard time finding anything for searching text files. Google Query Expansion doesn't explicitly say that it's an extension for Google APIs, but that's the impression that I get. Microsoft SQL Server seems to have this functionality built-in. There's an Apache Lucene module which also implements this. MySQL also has an implementation.

If you wish to use a pre-existing package it seems you will at the very least need to modify your program's structure such that the text is stored in a database. This would change your problem from a text search problem into a corpus search problem, which are heavily studied and will have more documentation and tools by outside sources. That said, without knowing your data I don't know if this is a worth-while solution nor what structure you should choose.

Best of luck. I am sorry I was unable to directly answer your question but I hope I gave you some good sources of information.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top