Performance optimizations of information retrieval in Prolog

Question 1

I think Boris's answer is reasonable from a Prolog developer point of view, but I didn't want to spend too much time to investigate how could I call Java functions from Prolog and vice versa, not to mention the problem of creating a standalone executable from the Prolog scripts which is a requirement for me.

So I was a bit stubborn and I tried to implement my original idea of creating an inverted index from the Prolog terms and it works flawlessly. So now when I want to search for a certain Prolog goal, in the first step I can filter the "database" for the relevant theories: I compute the intersection of the occurences of the terms, and I run the Prolog engine with only these theories. Another big performance optimization was to share a singleton TuProlog engine between the individual searches, and it also decreased the memory overhead. Oh and I also refactored the rules too, for example now I write this:

rel(nsubjpass, "seen","It", 1).

instead of this:

rel("nsubjpass","seen","It",S):-S is 1.

I didn't notice any big performance boost from this change, but at least it didn't get slower :)

With these changes, now it can be seen that the real bottleneck of the performance is not running the Prolog engine, but using NLP libraries...but it's another issue. :)

Question 2

I am writing this as an answer instead of a comment just because comments are very limiting.

First of all, if you indeed plan to use Prolog, it would pay off to do some research on

The language itself:
- what are its strengths and weaknesses?
- can I achieve my goals in idiomatic Prolog (have similar problems been solved already)
Different implementations:
- what does a particular implementation offer over others
- how easy it is to interface an implementation with other languages

After a short struggle in the beginning, I have been using SWI-Prolog exclusively, so I am biased. But SWI offers a very complete, general, reasonably efficient implementation, excellent documentation, and is completely open-source.

Moving on, you seem to be using Prolog facts to represent your data. Prolog facts in the form:

foo(a,b,1).
foo(a,d,10).
...

build the Prolog database. Most implementation offer some sort of indexing of facts, so that a query like "give me 'foo's with 'a' as the first argument":

?- foo(a,B,X).
B = b, X = 3;
B = d, X = 10.

does a very efficient search through all available facts and returns the ones matching your query. You definitely don't need to search through facts with a for loop, or code a fact search algorithm. Note, however, that fact indexing can vary between implementations.

Furthermore, depending on your use-cases, representing strings as Prolog atoms could be the best solution, efficient and easy-to-implement. Look here, especially towards the end, for some very useful "string manipulation" predicates on atoms.

Moving on, writing top-down, left-to-right parser in Prolog is a feature of the language, using DCGs (definite clause grammars).

At the end, if you are comfortable with Java and need only to add an efficient (dynamic) data store, Java + SQL might be the better choice.