PEG para recuo de estilo Python

https://stackoverflow.com/questions/4205442

25-09-2019
|

Pergunta

Como você escreveria um Analisando Gramática de Expressão em qualquer um dos seguintes Geradores de Analisador (PEG.js, Citrino, Copa da árvore) que pode lidar com o recuo do estilo Python/Haskell/CoffeScript:

Exemplos de uma linguagem de programação ainda não existente:

square x =
    x * x

cube x =
    x * square x

fib n =
  if n <= 1
    0
  else
    fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets

Atualizar:Não tente escrever um intérprete para os exemplos acima.Estou interessado apenas no problema de indentação.Outro exemplo pode ser analisar o seguinte:

foo
  bar = 1
  baz = 2
tap
  zap = 3

# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }

Solução

O pino puro não pode analisar o recuo.

Mas Peg.js posso.

Fiz um experimento rápido e sujo (sendo inspirado no comentário de Ira Baxter sobre trapacear) e escrevi um tokenizador simples.

Para uma solução mais completa (um analisador completo), consulte esta pergunta: Palário de indentação com PEG.JS

/* Initializations */
{
  function start(first, tail) {
    var done = [first[1]];
    for (var i = 0; i < tail.length; i++) {
      done = done.concat(tail[i][1][0])
      done.push(tail[i][1][1]);
    }
    return done;
  }

  var depths = [0];

  function indent(s) {
    var depth = s.length;

    if (depth == depths[0]) return [];

    if (depth > depths[0]) {
      depths.unshift(depth);
      return ["INDENT"];
    }

    var dents = [];
    while (depth < depths[0]) {
      depths.shift();
      dents.push("DEDENT");
    }

    if (depth != depths[0]) dents.push("BADDENT");

    return dents;
  }
}

/* The real grammar */
start   = first:line tail:(newline line)* newline? { return start(first, tail) }
line    = depth:indent s:text                      { return [depth, s] }
indent  = s:" "*                                   { return indent(s) }
text    = c:[^\n]*                                 { return c.join("") }
newline = "\n"                                     {}

depths é uma pilha de recortes. O indent () retribui uma variedade de tokens de indentação e inicia () desembrulhem a matriz para fazer o analisador se comportar um pouco como um fluxo.

Peg.js produz para o texto:

alpha
  beta
  gamma
    delta
epsilon
    zeta
  eta
theta
  iota

Esses resultados:

[
   "alpha",
   "INDENT",
   "beta",
   "gamma",
   "INDENT",
   "delta",
   "DEDENT",
   "DEDENT",
   "epsilon",
   "INDENT",
   "zeta",
   "DEDENT",
   "BADDENT",
   "eta",
   "theta",
   "INDENT",
   "iota",
   "DEDENT",
   "",
   ""
]

Este tokenizador até pega recuos ruins.

Outras dicas

Acho que uma linguagem sensível à indentação como essa é sensível ao contexto.Acredito que o PEG só pode fazer linguagens livres de contexto.

Observe que, embora a resposta de nalply seja certamente correta de que PEG.js pode fazer isso por meio de estado externo (ou seja, as temidas variáveis globais), pode ser um caminho perigoso a percorrer (pior do que os problemas habituais com variáveis globais).Algumas regras podem inicialmente corresponder (e depois executar suas ações), mas as regras pai podem falhar, fazendo com que a execução da ação seja inválida.Se o estado externo for alterado em tal ação, você poderá acabar com um estado inválido.Isso é horrível e pode causar tremores, vômitos e morte.Alguns problemas e soluções para isso estão nos comentários aqui: https://github.com/dmajda/pegjs/issues/45

Então, o que realmente estamos fazendo aqui com o recuo é criar algo como um bloco de estilo C que geralmente tem seu próprio escopo lexical. Se eu estivesse escrevendo um compilador para um idioma como esse, acho que tentaria fazer o Lexer acompanhar o indentação. Toda vez que o recuo aumenta, pode inserir um token '{'. Da mesma forma, toda vez que diminui, pode inserir um token '}'. Em seguida, escrever uma gramática de expressão com aparelhos encaracolados explícitos para representar o escopo lexical se torna mais direto.

Você pode fazer isso na árvore usando predicados semânticos. Nesse caso, você precisa de um predicado semântico que detecte o fechamento de um bloqueio recuado de espaço branco devido à ocorrência de outra linha que possui o mesmo ou menor indentação. O predicado deve contar o recuo da linha de abertura e retornar TRUE (bloco fechado) se o recuo da linha atual terminou no mesmo ou menor comprimento. Como a condição de fechamento depende do contexto, ela não deve ser memorada. Aqui está o código de exemplo que estou prestes a adicionar à documentação da Treetop. Observe que eu substituí o método SyntaxNode da Treetop para facilitar a visualização do resultado.

grammar IndentedBlocks
  rule top
    # Initialise the indent stack with a sentinel:
    &{|s| @indents = [-1] }
    nested_blocks
    {
      def inspect
        nested_blocks.inspect
      end
    }
  end

  rule nested_blocks
    (
      # Do not try to extract this semantic predicate into a new rule.
      # It will be memo-ized incorrectly because @indents.last will change.
      !{|s|
        # Peek at the following indentation:
        save = index; i = _nt_indentation; index = save
        # We're closing if the indentation is less or the same as our enclosing block's:
        closing = i.text_value.length <= @indents.last
      }
      block
    )*
    {
      def inspect
        elements.map{|e| e.block.inspect}*"\n"
      end
    }
  end

 rule block
    indented_line       # The block's opening line
    &{|s|               # Push the indent level to the stack
      level = s[0].indentation.text_value.length
      @indents << level
      true
    }
    nested_blocks       # Parse any nested blocks
    &{|s|               # Pop the indent stack
      # Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
      @indents.pop
      true
    }
    {
      def inspect
        indented_line.inspect +
          (nested_blocks.elements.size > 0 ? (
            "\n{\n" +
            nested_blocks.elements.map { |content|
              content.block.inspect+"\n"
            }*'' +
            "}"
          )
          : "")
      end
    }
  end

  rule indented_line
    indentation text:((!"\n" .)*) "\n"
    {
      def inspect
        text.text_value
      end
    }
  end

  rule indentation
    ' '*
  end
end

Aqui está um pequeno programa de driver de teste para que você possa experimentá -lo facilmente:

require 'polyglot'
require 'treetop'
require 'indented_blocks'

parser = IndentedBlocksParser.new

input = <<END
def foo
  here is some indented text
    here it's further indented
    and here the same
      but here it's further again
      and some more like that
    before going back to here
      down again
  back twice
and start from the beginning again
  with only a small block this time
END 

parse_tree = parser.parse input

p parse_tree

Sei que este é um tópico antigo, mas eu só queria adicionar alguns código PegJs às respostas. Este código analisará um pedaço de texto e o "ninho" em uma espécie de estrutura "ast-ish". Ele só se aprofunda e parece feio, além disso, não usa realmente os valores de retorno para criar a estrutura certa, mas mantém uma árvore na memória da sua sintaxe e retornará isso no final. Isso pode se tornar pesado e causar alguns problemas de desempenho, mas pelo menos faz o que deveria.

Nota: verifique se você tem guias em vez de espaços!

{ 
    var indentStack = [], 
        rootScope = { 
            value: "PROGRAM",
            values: [], 
            scopes: [] 
        };

    function addToRootScope(text) {
        // Here we wiggle with the form and append the new
        // scope to the rootScope.

        if (!text) return;

        if (indentStack.length === 0) {
            rootScope.scopes.unshift({
                text: text,
                statements: []
            });
        }
        else {
            rootScope.scopes[0].statements.push(text);
        }
    }
}

/* Add some grammar */

start
  = lines: (line EOL+)*
    { 
        return rootScope;
    }


line
  = line: (samedent t:text { addToRootScope(t); }) &EOL
  / line: (indent t:text { addToRootScope(t); }) &EOL
  / line: (dedent t:text { addToRootScope(t); }) &EOL
  / line: [ \t]* &EOL
  / EOF

samedent
  = i:[\t]* &{ return i.length === indentStack.length; }
    {
        console.log("s:", i.length, " level:", indentStack.length);
    }

indent
  = i:[\t]+ &{ return i.length > indentStack.length; }
    {
        indentStack.push(""); 
        console.log("i:", i.length, " level:", indentStack.length);
    }

dedent
    = i:[\t]* &{ return i.length < indentStack.length; }
      {
          for (var j = 0; j < i.length + 1; j++) {
              indentStack.pop();
          } 
          console.log("d:", i.length + 1, " level:", indentStack.length);  
      }

text
    = numbers: number+  { return numbers.join(""); } 
    / txt: character+   { return txt.join(""); }


number
    = $[0-9] 

character 
    = $[ a-zA-Z->+]  
__
    = [ ]+

_ 
    = [ ]*

EOF 
    = !.

EOL
    = "\r\n" 
    / "\n" 
    / "\r"

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow