Domanda

I'm trying to write a parser for a syntax highlighter (JSON/HTML/Markdown) as an exercise and I'm not really tokenizing anything so I'm reading a given string's characters in a while loop.

Right now, I have a string str that contains the markup, and I can do one of two things:

  1. keep track of the current character's position, increment it as needed, and append the relevant parts to another string final which will contain the syntax highlighted output
  2. splice or substring str on each iteration and append the relevant parts to final

Personally I'd prefer to use the second option because it allows me to modify str during the iteration if necessary and I only need to know the first character since the string gets substringed on each iteration, but I know there are downsides like not being able to look backwards.

What are some measurable downsides that I may be overlooking for each method? And is one generally preferred when writing a parser (I've seen people do both)?

È stato utile?

Soluzione

I've written several json parsers (unfortunately none of them is opensource to be able to point at). In my opinion one of the best techniques when writing a json parser is first writing a "SAX" like parser that takes an input string or stream, parses the input and emits events (for example to the listener you set):

  • BeginDictionary() (dictionary ~ json object if you like that more)
  • BeginDictionaryItem(key, quoted)
  • EndDictionary()
  • BeginList()
  • EndList()
  • Literal(string_value, quoted)

Literal: A quoted or non-quoted string. This sax parser doesn't have to recognize json data types like string, float, bool, this parser just recognizes everything as either a quoted or non-quoted string and passes it to your event listener. Sometimes when you parse a config file it can come handy to allow dictionary keys without quotes, the config looks much nicer this way... Also alowing comments and commas after the last dictionary/list item can be practical but that is already nonstandard json...

If you want a parser that parses only standard json data types not all kind of quoted/nonquoted strings/words then you can implement a listener that converts "literals" to json types and passes them to another kind of listener that accepts only json specific types. One possible optimization here is that the listener passes only indexes to the listener through events and these indexes point into the input string/stream if you don't want duplication/copying. Optionally you can write a parser listener implementation that unquotes a string in place and stores it right into the input buffer destructively if the given platform/language allows that assuming that the parser doesn't look back.

You can also write a parser listener that just stores syntax highlight info. You can write one that builds an in-memory DOM-like tree from your json... And you can use just one simple-stupid parser that can be plugged together with many different parser-listener implementations for different kind of tasks. You can also chain together listeners if you write them to allow this like: parser -> json type converter listener -> json schema validator listener -> json syntax highlight listener -> json dom builder listener...

And there is no need to separate a tokenizer and a parser in case of a simple language like json, write it as a single piece of recursive descent parser that does everything.

If you do it by optimizing for memory usage then the variable memory requirement of the parser is one bit per nested json object/list (but this version is not a recursive descent parser).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top