PEG per Python stile di indentazione

https://stackoverflow.com/questions/4205442

25-09-2019
|

Domanda

Come si scrive un di analisi Espressione Grammar in uno dei seguenti generatori Parser ( PEG.js , Citrus , Treetop ) che può gestire Python / Haskell / CoffeScript stile di indentazione:

Gli esempi di un linguaggio di programmazione non ancora esistente:

square x =
    x * x

cube x =
    x * square x

fib n =
  if n <= 1
    0
  else
    fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets

Aggiornamento: Non cercare di scrivere un interprete per gli esempi di cui sopra. Mi interessa solo il problema indentazione. Un altro esempio potrebbe essere l'analisi del seguente:

foo
  bar = 1
  baz = 2
tap
  zap = 3

# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }

Soluzione

Pure PEG non può rientro di analisi.

Ma peg.js può.

Ho fatto un esperimento di quick-and-dirty (ispirandosi commento di Ira Baxter circa barare) e ha scritto un semplice tokenizzatore.

Per una soluzione più completa (un parser completo) si veda questa domanda: Parse rientro livello con PEG.js

/* Initializations */
{
  function start(first, tail) {
    var done = [first[1]];
    for (var i = 0; i < tail.length; i++) {
      done = done.concat(tail[i][1][0])
      done.push(tail[i][1][1]);
    }
    return done;
  }

  var depths = [0];

  function indent(s) {
    var depth = s.length;

    if (depth == depths[0]) return [];

    if (depth > depths[0]) {
      depths.unshift(depth);
      return ["INDENT"];
    }

    var dents = [];
    while (depth < depths[0]) {
      depths.shift();
      dents.push("DEDENT");
    }

    if (depth != depths[0]) dents.push("BADDENT");

    return dents;
  }
}

/* The real grammar */
start   = first:line tail:(newline line)* newline? { return start(first, tail) }
line    = depth:indent s:text                      { return [depth, s] }
indent  = s:" "*                                   { return indent(s) }
text    = c:[^\n]*                                 { return c.join("") }
newline = "\n"                                     {}

depths è una pila di rientranze. trattino () restituisce una serie di gettoni di rientro e start () scarta la matrice per rendere il parser si comportano un po 'come un torrente.

peg.js produce il testo:

alpha
  beta
  gamma
    delta
epsilon
    zeta
  eta
theta
  iota

questi risultati:

[
   "alpha",
   "INDENT",
   "beta",
   "gamma",
   "INDENT",
   "delta",
   "DEDENT",
   "DEDENT",
   "epsilon",
   "INDENT",
   "zeta",
   "DEDENT",
   "BADDENT",
   "eta",
   "theta",
   "INDENT",
   "iota",
   "DEDENT",
   "",
   ""
]

Questa tokenizer cattura anche cattivi trattino.

Altri suggerimenti

Credo che un linguaggio sensibile al rientro del genere è sensibile al contesto. Credo PEG può fare solo linguaggi liberi dal contesto.

Si noti che, mentre la risposta di nalply è certamente corretto che PEG.js può farlo tramite stato esterno (cioè le variabili globali temute), può essere un percorso pericoloso camminare verso il basso (peggio che i soliti problemi con le variabili globali). Alcune regole possono inizialmente abbinare (e quindi eseguire le loro azioni), ma le regole genitore può fallire provocando così la corsa di azione per essere valido. Se lo stato esterno viene modificato in tale azione una, si può finire con stato non valido. Questo è super terribile, e potrebbe portare a tremori, vomito, e la morte. Alcuni problemi e soluzioni a questo sono nei commenti qui: https://github.com/dmajda/pegjs / temi / 45

Quindi quello che stiamo veramente facendo qui con rientro è la creazione di qualcosa di simile a un blocchi di C-stile che spesso hanno un proprio ambito lessicale. Se dovessi scrivere un compilatore per un linguaggio del genere penso che vorrei cercare di avere la pista lexer mastio del rientro. Ogni volta che il rientro aumenta potrebbe inserire un '{' token. Similmente ogni volta diminuisce potrebbe inserto un '}' token. Poi scrivere una grammatica un'espressione con parentesi graffe esplicite per rappresentare lessicale portata diventa più semplice.

E 'possibile fare questo in Treetop utilizzando predicati semantici. In questo caso è necessario un predicato semantico che rileva chiusura di un bianco-spazio rientrati blocco a causa del verificarsi di un'altra linea che ha lo stesso o minore indentazione. Il predicato deve contare il rientro dalla linea di apertura, e restituire true (blocco chiusa) se indentazione della linea corrente è terminato alla stessa o minore lunghezza. Perché la condizione di chiusura è dipendente dal contesto, non deve essere memoized. Ecco il codice di esempio che sto per aggiungere alla documentazione del Cima. Si noti che ho sovrascritto SyntaxNode di Treetop metodo ispezionare per rendere più facile per visualizzare il risultato.

grammar IndentedBlocks
  rule top
    # Initialise the indent stack with a sentinel:
    &{|s| @indents = [-1] }
    nested_blocks
    {
      def inspect
        nested_blocks.inspect
      end
    }
  end

  rule nested_blocks
    (
      # Do not try to extract this semantic predicate into a new rule.
      # It will be memo-ized incorrectly because @indents.last will change.
      !{|s|
        # Peek at the following indentation:
        save = index; i = _nt_indentation; index = save
        # We're closing if the indentation is less or the same as our enclosing block's:
        closing = i.text_value.length <= @indents.last
      }
      block
    )*
    {
      def inspect
        elements.map{|e| e.block.inspect}*"\n"
      end
    }
  end

 rule block
    indented_line       # The block's opening line
    &{|s|               # Push the indent level to the stack
      level = s[0].indentation.text_value.length
      @indents << level
      true
    }
    nested_blocks       # Parse any nested blocks
    &{|s|               # Pop the indent stack
      # Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
      @indents.pop
      true
    }
    {
      def inspect
        indented_line.inspect +
          (nested_blocks.elements.size > 0 ? (
            "\n{\n" +
            nested_blocks.elements.map { |content|
              content.block.inspect+"\n"
            }*'' +
            "}"
          )
          : "")
      end
    }
  end

  rule indented_line
    indentation text:((!"\n" .)*) "\n"
    {
      def inspect
        text.text_value
      end
    }
  end

  rule indentation
    ' '*
  end
end

Ecco un piccolo programma di test driver in modo da poter provare facilmente:

require 'polyglot'
require 'treetop'
require 'indented_blocks'

parser = IndentedBlocksParser.new

input = <<END
def foo
  here is some indented text
    here it's further indented
    and here the same
      but here it's further again
      and some more like that
    before going back to here
      down again
  back twice
and start from the beginning again
  with only a small block this time
END 

parse_tree = parser.parse input

p parse_tree

So che questo è un vecchio thread, ma volevo solo aggiungere del codice PEGjs alle risposte. Questo codice analizzare un pezzo di testo e "nido" in una sorta di struttura "AST-ish". Si va solo un profondo e sembra brutto, inoltre, in realtà non utilizza i valori di ritorno per creare la giusta struttura, ma mantiene un albero in memoria della vostra sintassi e ritornerà che alla fine. Questo potrebbe diventare ingombrante e causare alcuni problemi di prestazioni, ma almeno fa quello che si suppone.

Nota: Assicurarsi di avere le schede al posto degli spazi

{ 
    var indentStack = [], 
        rootScope = { 
            value: "PROGRAM",
            values: [], 
            scopes: [] 
        };

    function addToRootScope(text) {
        // Here we wiggle with the form and append the new
        // scope to the rootScope.

        if (!text) return;

        if (indentStack.length === 0) {
            rootScope.scopes.unshift({
                text: text,
                statements: []
            });
        }
        else {
            rootScope.scopes[0].statements.push(text);
        }
    }
}

/* Add some grammar */

start
  = lines: (line EOL+)*
    { 
        return rootScope;
    }


line
  = line: (samedent t:text { addToRootScope(t); }) &EOL
  / line: (indent t:text { addToRootScope(t); }) &EOL
  / line: (dedent t:text { addToRootScope(t); }) &EOL
  / line: [ \t]* &EOL
  / EOF

samedent
  = i:[\t]* &{ return i.length === indentStack.length; }
    {
        console.log("s:", i.length, " level:", indentStack.length);
    }

indent
  = i:[\t]+ &{ return i.length > indentStack.length; }
    {
        indentStack.push(""); 
        console.log("i:", i.length, " level:", indentStack.length);
    }

dedent
    = i:[\t]* &{ return i.length < indentStack.length; }
      {
          for (var j = 0; j < i.length + 1; j++) {
              indentStack.pop();
          } 
          console.log("d:", i.length + 1, " level:", indentStack.length);  
      }

text
    = numbers: number+  { return numbers.join(""); } 
    / txt: character+   { return txt.join(""); }


number
    = $[0-9] 

character 
    = $[ a-zA-Z->+]  
__
    = [ ]+

_ 
    = [ ]*

EOF 
    = !.

EOL
    = "\r\n" 
    / "\n" 
    / "\r"

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow