Parsing inflected non-word-order languages (e.g. Latin)

https://stackoverflow.com/questions/17939291

prolog
dcg

04-06-2022
|

Question

Taking an example from the Introduction to Latin Wikiversity, consider the sentence:

the sailor gives the girl money

We can handle this in Prolog with a DCG fairly elegantly with this pile of rules:

sentence(s(NP, VP)) --> noun_phrase(NP), verb_phrase(VP).
noun_phrase(Noun) --> det, noun(Noun).
noun_phrase(Noun) --> noun(Noun).
verb_phrase(vp(Verb, DO, IO)) --> verb(Verb), noun_phrase(IO), noun_phrase(DO).

det --> [the].
noun(X) --> [X], { member(X, [sailor, girl, money]) }.
verb(gives) --> [gives].

And we see that this works:

?- phrase(sentence(S), [the,sailor,gives,the,girl,money]).
S = s(sailor, vp(gives, money, girl)) ;

It seems to me that the DCG is really optimized for handling word-order languages. I'm at a complete loss as to how to handle this Latin sentence:

 nauta dat pecuniam puellae

This means the same thing (the sailor gives the girl money), but the word order is completely free: all of these permutations also mean exactly the same thing:

nauta dat puellae pecuniam
nauta puellae pecuniam dat
puellae pecuniam dat nauta
puellae pecuniam nauta dat
dat pecuniam nauta puellae

The first thing that occurs to me is to enumerate the permutations:

sentence(s(NP, VP)) --> noun_phrase(NP), verb_phrase(VP).
sentence(s(NP, VP)) --> verb_phrase(VP), noun_phrase(NP).

but this won't do, because while nauta belongs to the subject noun phrase, puellae which belongs to the object noun phrase is subordinate to the verb, but can precede it. I wonder if I should approach it by building some kind of attributed list first like so:

?- attributed([nauta,dat,pecuniam,puellae], Attributed)
Attributed = [noun(nauta,nom), verb(do,3,s), noun(pecunia,acc), noun(puella,dat)]

This seems like it will turn out to be necessary (and I don't see a good way to do it), but grammatically it's pushing food around on my plate. Maybe I could write a parser with some kind of horrifying non-DCG contraption like this:

parse(s(NounPhrase, VerbPhrase), Attributed) :-
  parse(subject_noun_phrase(NounPhrase, Attributed)),
  parse(verb_phrase(VerbPhrase, Attributed)).

parse(subject_noun_phrase(Noun), Attributed) :- 
  member(noun(Noun,nom), Attributed).

parse(object_noun_phrase(Noun), Attributed) :-
  member(noun(Noun,acc), Attributed)

This seems like it would work, but only as long as I have no recursion; as soon as I introduce a subordinate clause I'm going to reuse subjects in an unhealthy way.

I just don't see how to get from a non-word-order sentence to a parse tree. Is there a book that discusses this? Thanks.

Solution

Here I found a related resource (PERMUTATIONAL GRAMMAR FOR FREE WORD ORDER LANGUAGES). Seems worth to read (Hey, we all hated so much those mandatory Latin lessons, back in 60s !).

In appendix there is an implementation to test.

I forgot to point out Covington' free-word-order parser (it's just a sketch...) You can find in PRoNTo toolkit (I report here for sake of completeness, but I'm fairly sure you already know about it).

OTHER TIPS

Seems like (drawing from my extremely rusty memory of high school Latin), your lexical analyzer needs to look at each token (word) and attribute each token with appropriate meta-data:

type of word (noun, verb, adjective, etc.)
For nouns, declension, gender, case and number
For verbs, conjugation, person, number, tense, voice and mood
For adjectives, gender, declension, number...
etc. (It's been a long time LOL).

Then your parse should be guided by the metadata, since that's what ties everything together.

You could use this meta clause:

unsorted([]) --> [].
unsorted([H|T]) -->
    H, unsorted(T).
unsorted([H|T]) -->
    unsorted(T), H.

sentence(s(NP, VP)) --> unsorted([noun_phrase(NP), verb_phrase(VP)]).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow