In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.
Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.
For the parser, we treat introductions to heredocs <<END
as string literals. But the lexer has to be extended to do the following:
- When a heredoc introduction is encountered, it adds the terminator to the stack.
- When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.
Take care to update the line number appropriately.
In a hand-written combined parser/lexer, this could be implemented like so:
use strict; use warnings; use 5.010;
my $s = <<'INPUT-END'; pos($s) = 0;
<<A <<B
body 1
A
body 2
B
<<C
body 3
C
INPUT-END
my @strs;
push @strs, parse_line() while pos($s) < length($s);
for my $i (0 .. $#strs) {
say "STRING $i:";
say $strs[$i];
}
sub parse_line {
my @strings;
my @heredocs;
$s =~ /\G\s+/gc;
# get the markers
while ($s =~ /\G<<(\w+)/gc) {
push @strings, '';
push @heredocs, [ \$strings[-1], $1 ];
$s =~ /\G[^\S\n]+/gc; # spaces that are no newlines
}
# lex the EOL
$s =~ /\G\n/gc or die "Newline expected";
# process the deferred heredocs:
while (my $heredoc = shift @heredocs) {
my ($placeholder, $marker) = @$heredoc;
$s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
$$placeholder = $1;
}
return @strings;
}
Output:
STRING 0:
body 1
STRING 1:
body 2
STRING 2:
body 3
The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.