Parsing in Prolog without cut?

https://stackoverflow.com/questions/6358556

28-10-2019
|

Question

I found this nice snippet for parsing lisp in Prolog (from here):

ws --> [W], { code_type(W, space) }, ws.
ws --> [].

parse(String, Expr) :- phrase(expressions(Expr), String).

expressions([E|Es]) -->
    ws, expression(E), ws,
    !, % single solution: longest input match
    expressions(Es).
expressions([]) --> [].

% A number N is represented as n(N), a symbol S as s(S).

expression(s(A))         --> symbol(Cs), { atom_codes(A, Cs) }.
expression(n(N))         --> number(Cs), { number_codes(N, Cs) }.
expression(List)         --> "(", expressions(List), ")".
expression([s(quote),Q]) --> "'", expression(Q).

number([D|Ds]) --> digit(D), number(Ds).
number([D])    --> digit(D).

digit(D) --> [D], { code_type(D, digit) }.

symbol([A|As]) -->
    [A],
    { memberchk(A, "+/-*><=") ; code_type(A, alpha) },
    symbolr(As).

symbolr([A|As]) -->
    [A],
    { memberchk(A, "+/-*><=") ; code_type(A, alnum) },
    symbolr(As).
symbolr([]) --> [].

However expressions uses a cut. I'm assuming this is for efficiency. Is it possible to write this code so that it works efficiently without cut?

Would also be in interested answers that involve Mercury's soft-cut / committed choice.

Solution

The cut is not used for efficiency, but to commit to the first solution (see the comment next to the !/0: "single solution: longest input match"). If you comment out the !/0, you get for example:

?- parse("abc", E).
E = [s(abc)] ;
E = [s(ab), s(c)] ;
E = [s(a), s(bc)] ;
E = [s(a), s(b), s(c)] ;
false.

It is clear that only the first solution, consisting of the longest sequence of characters that form a token, is desired in such cases. Given the example above, I therefore disagree with "false": expression//1 is ambiguous, because number//1 and symbolr//1 are. In Mercury, you could use the determinism declaration cc_nondet to commit to a solution, if any.

OTHER TIPS

You are touching a quite deep problem here. At the place of the cut you have added the comment "longest input match". But what you actually did was to commit to the first solution which will produce the "longest input match" for the non-terminal ws//0 but not necessarily for expression//1.

Many programming languages define their tokens based on the longest input match. This often leads to very strange effects. For example, a number may be immediately followed by a letter in many programming languages. That's the case for Pascal, Haskell, Prolog and many other languages. E.g. if a>2then 1 else 2 is valid Haskell. Valid Prolog: X is 2mod 3.

Given that, it might be a good idea to define a programming language such that it does not depend on such features at all.

Of course, you would then like to optimize the grammar. But I can only recommend to start with a definition that is unambiguous in the first place.

As for efficiency (and purity):

eos([],[]).

nows --> call(eos).
nows, [W] --> [W], { code_type(W, nospace) }.

ws --> nows.
ws --> [W], {code_type(W, space)}, ws.

You could use a construct that has already found its place in Parsing Expression Grammars (PEGs) but which is also available in DCGs. Namely the negation of a DCG goal. In PEGs the exclamation mark (!) with an argument is used for negation, i.e. ! e. In DCG the negation of a DCG goal is expressed by the (\+) operator, which is already used for ordinary negation as failure in ordinary Prolog clauses and queries.

So lets first explain how (\+) works in DCGs. If you have a production rule of the form:

 A --> B, \+C, D.

Then this is translated to:

 A(I,O) :- B(I,X), \+ C(X,_), D(X,O).

Which means an attempt is made to parse the C DCG goal, but without actually consuming the input list. Now this can be used to replace the cut, if desired, and it gives a little bit more declarative feeling. To explain the idea lets assume that with have a grammar without ws//0. So the original clause set of expressions//1 would be:

expressions([E|Es]) --> expression(E), !, expressions(Es).
expressions([]) --> [].

With negation we can turn this into the following cut-less form:

expressions([E|Es]) --> expression(E), expressions(Es).
expressions([]) --> \+ expression(_).

Unfortunately the above variant is quite un-efficient, since an attempt to parse an expression is made twice. Once in the first rule, and then again in the second rule for the negation. But you could do the following and only check for the negation of the beginning of an expression:

expressions([E|Es]) --> expression(E), expressions(Es).
expressions([]) --> \+ symbol(_), \+ number(_), \+ "(", \+ "'".

If you try negation, you will see that you get a relatively strict parser. This is important if you try to parse maximum prefix of input and if you want to detect some errors. Try that:

?- phrase(expressions(X),"'",Y).

You should get failure in the negation version which checks the first symbol of the expression. In the cut and in the cut free version you will get success with the empty list as a result.

But you could also deal in another way with errors, I have only made the error example to highlight a little bit how the negation version works.

In other settings, for example CYK parser, one can make the negation quite efficient, it can use the information which is already placed in the chart.

Best Regards

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow