문제

I am trying to develop a complex textual search engine. I have thousands of textual pages from many books. I need to search pages that contain specified complex logical criterias. These criterias can contain virtually any compination of the following:

A: Full words.

B: Word roots (semilar to stems; i.e. all words with certain key letters).

C: Word templates (in some languages roots are filled in certain templates to form various part of speech such as adjactives, past/present verbs...).

D: Logical connectives: AND/OR/XOR/NOT/IF/IFF and parentheses to state priorities.

Now, would it be faster to have the pages' full text in database (not indexed) and search through them all using SQL and Regular Expressions ?

Or would it be better to construct indexes of word/root/template-page-location tuples. Hence, we can boost searching for individual words/roots/templates. However, it gets tricky as we introduce logical connectives into our queries. I thought of doing the following steps in such cases:

1: Seperately search for each individual words/roots/templates in the specified query.

2: On priority bases, we merge two result lists (from step 1) at a time depedning on the logical connective

For example, if we are searching for "he AND (is OR was)":

1: We shall search for "he", "is" and "was" seperately and get result lists for each word.

2: Merge the result lists of "is" and "was" using the merging function OR-MERGE.

3: Merge the merged result list from the OR-MERGE function with the one of "he" using the merging function AND-MERGE.

The result of step 3 is then returned as the result of the specified query.

What do you think gurues ? Which is faster ? Any better ideas ?

Thank you all in advance.

도움이 되었습니까?

해결책

There are plenty of off-the-shelf solutions to this kind of problem. I would strongly recommend you use one of those instead of developing your own.

You don't say what database solution you're using. If it's Microsoft SQL Server, you could use its Full Text Search features. If it's MySQL, take a look at its Full-Text Search Functions. I'm sure Oracle, DB2 and any other major DBMS will have similar functionality.

Alternatively, take a look at Apache's Lucene for Java or Lucene for .NET. This will allow you to index documents without needing to use a DBMS.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top