I wanted to do some information retrieval tasks with Prolog. Currently I have a (large) set of distinct Prolog theories representing the dependencies in a sentence (incidentally, I store these Prolog codes in a text file) - and I want to find only those theories which are matching for a user defined goal clause.

For example I have Prolog codes like this:

rel("nsubjpass","seen","It",S):-S is 1.
rel("aux","seen","can",S):-S is 1.
rel("auxpass","seen","be",S):-S is 1.
...
rel("prep", X, Y, S):-rel("dobj", Z, X, SCORE1), rel("prep", Z, Y, SCORE2), S is SCORE1*SCORE2.
...
rel(_,_,_, S):-S is 0.0001.

And I want to search for goal clauses like this:

?- rel("nsubj", WRONG, "Russell", SCORE1), 
   rel("nsubj", WRONG, "Russell", SCORE2), 
   ... , 
   rel("dobj",DOING,SOMETHING, SCORE9).

If I do a simple foreach-like loop over the theories, the search time will be getting slow as the number of the theories increases, so I have to introduce some optimalizations.

My idea is that to create an inverted index, where I could maintain the frequency of every term, and the ID of the theories where they are occured. Then when it comes to search, I would first filter out the unnecessary theories which otherwise wouldn't contained in the results.

Are there any other well-tried methods or good patterns and algorithms on the field of information retrieval that can handle this problem well?

有帮助吗?

解决方案 2

I think Boris's answer is reasonable from a Prolog developer point of view, but I didn't want to spend too much time to investigate how could I call Java functions from Prolog and vice versa, not to mention the problem of creating a standalone executable from the Prolog scripts which is a requirement for me.

So I was a bit stubborn and I tried to implement my original idea of creating an inverted index from the Prolog terms and it works flawlessly. So now when I want to search for a certain Prolog goal, in the first step I can filter the "database" for the relevant theories: I compute the intersection of the occurences of the terms, and I run the Prolog engine with only these theories. Another big performance optimization was to share a singleton TuProlog engine between the individual searches, and it also decreased the memory overhead. Oh and I also refactored the rules too, for example now I write this:

rel(nsubjpass, "seen","It", 1).

instead of this:

rel("nsubjpass","seen","It",S):-S is 1.

I didn't notice any big performance boost from this change, but at least it didn't get slower :)

With these changes, now it can be seen that the real bottleneck of the performance is not running the Prolog engine, but using NLP libraries...but it's another issue. :)

其他提示

I am writing this as an answer instead of a comment just because comments are very limiting.

First of all, if you indeed plan to use Prolog, it would pay off to do some research on

  1. The language itself:
    • what are its strengths and weaknesses?
    • can I achieve my goals in idiomatic Prolog (have similar problems been solved already)
  2. Different implementations:
    • what does a particular implementation offer over others
    • how easy it is to interface an implementation with other languages

After a short struggle in the beginning, I have been using SWI-Prolog exclusively, so I am biased. But SWI offers a very complete, general, reasonably efficient implementation, excellent documentation, and is completely open-source.

Moving on, you seem to be using Prolog facts to represent your data. Prolog facts in the form:

foo(a,b,1).
foo(a,d,10).
...

build the Prolog database. Most implementation offer some sort of indexing of facts, so that a query like "give me 'foo's with 'a' as the first argument":

?- foo(a,B,X).
B = b, X = 3;
B = d, X = 10.

does a very efficient search through all available facts and returns the ones matching your query. You definitely don't need to search through facts with a for loop, or code a fact search algorithm. Note, however, that fact indexing can vary between implementations.

Furthermore, depending on your use-cases, representing strings as Prolog atoms could be the best solution, efficient and easy-to-implement. Look here, especially towards the end, for some very useful "string manipulation" predicates on atoms.

Moving on, writing top-down, left-to-right parser in Prolog is a feature of the language, using DCGs (definite clause grammars).

At the end, if you are comfortable with Java and need only to add an efficient (dynamic) data store, Java + SQL might be the better choice.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top