Pergunta

I'm working on parsers that not only process delimited content, but also escape sequences within certain portions of that content. I'm contemplating the efficiency of several approaches to tokenization and would appreciate thoughts on the following.

To use JSON as but one example, there's a few ways to process the following:

["Foo\n"]

Consider that the input stream might be multi-megabyte with lots of varied escaped/un-escaped sequences

Approaches:

By Character

The first option is simply to tokenize by character:

[array string #"F" #"o" #"o" #"^/" /string /array]

Pro: uniform, quick to implement, works well with streamed content

Con: not at all efficient as you are invoking the token handler for near as many tokens as there are characters in the input stream.

By Escaped/Un-Escaped Tokens

A somewhat more efficient tokenizer might result in:

[array string "Foo" "^/" /string /array]

Pro: Somewhat more efficient, quick to implement

Con: There's still a lot of tokens for heavily escaped content, can't imply that two tokens represent one or two items

By Whole Tokens

A minimal tokenizer might produce the following:

[array "Foo^/" /array]

Pro: Far fewer tokens to handle

Con: This raises many questions, chief amongst them—how is the string "Foo^/" created? To break this down, will consider two sub-approaches:

Match the sequence, then resolve escapes:

This might be handled thus:

[
      "[" (emit 'array)
    | "]" (emit /array)
    | {"} copy value string-sequence {"} (emit de-escape value)
]

Pros: Quickly identify matches, use and modify a single string

Cons: This is effectively a two-pass process—there may be two separate rules to match escape sequences: one in string-sequence and one in de-escape—there's the extra effort in ensuring they are consistent

Match portions of the string and append to a buffer:

This might look like:

[
      "[" (emit 'array)
    | "]" (emit /array)
    | {"} (buffer: make string! "") some [
          copy part unescaped-sequence (append buffer part)
        | "\n" (append buffer "^/")
    ] {"} (emit buffer)
]

Pros: One pass

Cons: Now we're back to handling chunks similar to the 'Handling escaped/un-escaped sequence' method and managing an additional value buffer.

Foi útil?

Solução

Most parsers decode string escapes during tokenization. Note that the tokenizer needs to be aware of escapes anyway to determine the end of a string, since the string delimiter " itself may be escaped. That means it is not fundamentally more complicated to also decode it in one pass.

It is not worse to first recognize the string literal and then process escapes. In low-level languages (like C) a two-phase approach would allow you to know the exact size of the decode string before you process escapes. This is not a good argument in your case. It may still be a good idea if this approach is simpler, but your de-escape function will amount to something like the single-pass tokenizer anyway.

A tokenizer should produce one token per literal because that makes it much easier to handle for consumers of the token stream. In particular, a string literal split into multiple tokens would be concatenated into a single string anyway at some point, so we ought to do that right away.

A notable scenario where that is not the case is in complicated languages with string “literals” that allow interpolation – tokenizing those elegantly is challenging, and would likely lead to something like your “tokenize by escaped/unescaped tokens” suggestion. As another example, the XML document model may contain multiple consecutive text nodes. XML entities are tricky because a tokenizer will have to parse the DTD before entities can be expanded, and because eagerly expanding entities opens the program to a denial-of-service attack.

TL;DR: Prefer your By Whole Tokens approach, ideally implemented via your Match portions of the string and append to a buffer idea.

Licenciado em: CC-BY-SA com atribuição
scroll top