Simple ParseKit grammar for HTML with replacement variables

https://stackoverflow.com/questions/9304318

30-04-2021
|

Question

For an iOS application, I want to parse an HTML file that may contain UNIX style variables for replacement. For example, the HTML may look like:

<html>
  <head></head>
  <body>
    <h1>${title}</h1>
    <p>${paragraph1}</p>
    <img src="${image}" />
  </body>
</html>

I'm trying to create a simple ParseKit grammar that will provide me two callbacks: One for passthrough HTML, and another for the variables it detects. For that, I created the following grammar:

@start        = Empty | content*;

content       = variable | passThrough;
passThrough   = /[^$]+/;
variable      = '$' '{' Word closeChar;

openChar      = '${';
closeChar     = '}';

I'm facing at least two issues with this: for variable I had originally declared it as openChar Word closeChar, but it did not work (I still don't know why). The second issue (and more important) is that the parser stops when it finds <img src"${image}" /> (i.e. a variable inside a quoted string).

My questions are:

How can I modify the grammar to make it work as expected?
Is it better to use a tokenizer? If that's the case, how should I configure it?

Solution

Developer of ParseKit here. I'll answer both of your questions:

1) You are taking the correct approach, but this is a tricky case. There are several small gotchas, and your Grammar needs to be changed a bit.

I've developed a grammar which is working for me:

// Tokenizer Directives
@symbolState = '"' "'"; // effectively tells the tokenizer to turn off QuoteState. 
                      // Otherwise, variables enclosed in quotes would not be found (they'd be embedded in quoted strings). 
                      // now single- & double-quotes will be recognized as individual symbols, not start- & end-markers for quoted strings

@symbols = '${'; // declare '${' as a multi-char symbol

@reportsWhitespaceTokens = YES; // tell the tokenizer to preserve/report whitespace

// Grammar
@start = content*;
content = passthru | variable;
passthru = /[^$].*/;
variable = start name end;
start = '${';
end = '}';
name = Word;

Then implement these two callbacks in your Assembler:

- (void)parser:(PKParser *)p didMatchName:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);
    PKToken *tok = [a pop];

    NSString *name = tok.stringValue;
    // do something with name
}

- (void)parser:(PKParser *)p didMatchPassthru:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);
    PKToken *tok = [a pop];

    NSMutableString *s = a.target;
    if (!s) {
        s = [NSMutableString string];
    }

    [s appendString:tok.stringValue];

    a.target = s;
}

And then your client/driver code will look something like this:

NSString *g = // fetch grammar
PKParser *p = [[PKParserFactory factory] parserFromGrammar:g assembler:self];
NSString *s = @"<img src=\"${image}\" />";
[p parse:s];
NSString *result = [p parse:s];
NSLog(@"result %@", result);

This will be printed:

result: <img src="" />

2) Yes, I think it would definitely be much better to use the Tokenizer directly for this relatively simple case. Performance will be massively better. Here's how you might approach the task with the Tokenizer:

PKTokenizer *t = [PKTokenizer tokenizerWithString:s];
[t setTokenizerState:t.symbolState from:'"' to:'"'];
[t setTokenizerState:t.symbolState from:'\'' to:'\''];
[t.symbolState add:@"${"];
t.whitespaceState.reportsWhitespaceTokens = YES;

NSMutableString *result = [NSMutableString string];

PKToken *eof = [PKToken EOFToken];
PKToken *tok = nil;
while (eof != (tok = [t nextToken])) {
    if ([@"${" isEqualToString:tok.stringValue]) {
        tok = [t nextToken];
        NSString *varName = tok.stringValue;

        // do something with variable
    } else if ([@"}" isEqualToString:tok.stringValue]) {
        // do nothing
    } else {
        [result appendString:tok.stringValue];
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow