Parslet word until delimeter present

Question

TL;DR; If you care what is before the "^" you should parse that first.

--- longer answer ---

A parser will always consume all the text. If it can't consume everything, then the document is not fully described by the grammar. Rather than thinking of it as something performing "splits" on your text... instead think of it as a clever state machine consuming a stream of text.

So... as your full grammar needs to consume all the document... when developing your parser, you can't make it to parse some part and leave the rest. You want it to transform your document into a tree so you can manipulate it into it's final from.

If you really wanted to just consume all text before a delimiter, then you could do something like this...

Say I was going to parse a '^' separated list of things.

I could have the following rules

rule(:thing) { (str("^").absent? >> any).repeat(1) }  # anything that's not a ^
rule(:list)  { thing >> ( str("^") >> thing).repeat(0) } #^ separated list of things

This would work as follows

parse("thing1^thing2") #=> "thing1^thing2"
parse("thing1") #=> "thing1"
parse("thing1^") #=> ERROR ... nothing after the ^ there should be a 'thing'

This would mean list would match a string that doesn't end or start with an '^'. To be useful however I need to pull out the bits that are the values with the "as" keyword

rule(:thing) { (str("^").absent? >> any).repeat(1).as(:thing) }
rule(:list)  { thing >> ( str("^") >> thing).repeat(0) }

Now when list matches a string I get an array of hashes of "things".

parse("thing1^thing2") #=> [ {:thing=>"thing1"@0} , {:thing=>"thing2"@7} ]

In reality however you probably care what a 'thing' is... not just anything will go there.

In that case.. you should start by defining those rules... because you don't want to use the parser to split by "^" then re-parse the strings to work out what they are made of.

For example:

parse("6 + 4 ^ 2") 
 # => [ {:thing=>"6 + 4 "@0}, {:thing=>" 2"@7} ]

And I probably want to ignore the white_space around the "thing"s and I probably want to deal with the 6 the + and the 4 all separately. When I do that I am going to have to throw away my "all things that aren't '^'" rule.