Вопрос

I stumbled upon strange fulltextindex behavior in SQL Server 2008 R2 (my word-breaker language is German).

I have this text indexed:

[...] Java Editorerstellung in Eclipse eines Modellierungseditors(UML) mit den Eclipse Technologien [...]

I triple-checked: The only occurrence of the term edi is in this short snippet of text, I can only find it as part of Editorerstellung und Modellierungseditors.

But SQL Server still has edi as a single word in it's fulltextindex (occurrence: 1) and therefore returns it on ContainsTable(...) searches. Why is it recognized as a single word?

Has anybody an explanation for this behavior? Thanks.

Это было полезно?

Решение

German Compound words require special parsing in natural language word breakers. For example, "Editorerstellung" is parsed and stored as three separate terms, "editor", "erstellung" and "editorerstellung". Extensive research has been done on analyzing German compund words and while the techniques are improving, the process is not perfect.

It is likely that the behavior you are seeing is due to heuristics being using in the word breaker. I cannot re-produce your issue using the above snippet and the Sql Server 2012 word breaker, so either Microsoft's improvement in the German word-breaker between Sql Server 2008 R2 and Sql Server 2012 solved the problem or some text you didn't include is the source of "edi" in the full-text index.

You can use sys.dm_fts_index_keywords_by_document() to see what terms are in the index. Using a binary search pattern, you should be able to narrow it down to the specific text that is generating the "edi" term.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top