Parsing text structured as tree with fixed width columns using parslet in ruby

https://stackoverflow.com/questions/14741903

07-03-2022
|

Question

I'm stuck. For couple of days been trying to parse this text (look at bottom). But can't figure out some things. Firstly text is formatted in tree structure with fixed width columns but exact column width depends on widest field.

I'm using ruby, first I tried Treetop gem and made some progress, but then decided to try Parslet so I'm using it now and it seems should be easier with it, but it's hard to find detailed documentation for it.

currently I parse each line individually and create array with parsed entries, but that's not correct as I loose structure. I need parse it recursively and handle depth.

I would really appreciate any tips, ideas, suggestions.

here's my current code, it works, but all data is flattened. my current idea is to parse recursively if current line start position is bigger than previous ones (ie. width) thus it means we should go in deeper level. Actually I managed to make it such but then I couldn't get outside properly so I've removed that code.

require 'pp'
require 'parslet'
require 'parslet/convenience'


class TextParser < Parslet::Parser
    @@width = 5

    root :text

    rule(:text)   { (line >> newline).repeat }

    rule(:line) { left >> ( topline | subline ).as(:entry) }

    rule(:topline) {
        float.as(:number) >> str('%') >> space >> somestring.as(:string1) >> space >> specialstring.as(:string2) >> space >> specialstring.as(:string3)
    }

    rule(:subline) {
        dynamic { |source, context|
            width = context.captures[:width].to_s.length
            width = width-1 if context.captures[:width].to_s[-1] == '|'
            if width > @@width
                # should be recursive
                result = ( specialline | lastline | otherline | empty )
            else
                result = ( specialline | lastline | otherline | empty )
            end
            @@width = width
            result
        }
    }

    rule(:otherline) {
        somestring.as(:string1)
    }

    rule(:specialline) {
        float.as(:number) >> str('%') >> dash >> space? >> specialstring.as(:string1)
    }

    rule(:lastline) {
        float.as(:number) >> str('%') >> dash >> space? >> str('[...]')
    }

    rule(:empty) {
        space?
    }

    rule(:left) {  seperator.capture(:width) >> dash?.capture(:dash) >> space? }

    rule(:somestring) { match['0-9A-Za-z\.\-'].repeat(1) }
    rule(:specialstring) { match['0-9A-Za-z&()*,\.:<>_~'].repeat(1) }

    rule(:space) { match('[ \t]').repeat(1) }
    rule(:space?) { space.maybe }
    rule(:newline) { space? >> match('[\r\n]').repeat(1) }

    rule(:seperator) { space >> (str('|') >> space?).repeat }
    rule(:dash) { space? >> str('-').repeat(1) }
    rule(:dash?) { dash.maybe }

    rule(:float)   { (digits >> str('.') >> digits) }
    rule(:digits)   { match['0-9'].repeat(1) }

end

parser = TextParser.new

file = File.open("text.txt", "rb")
contents = file.read.to_s
file.close

pp parser.parse_with_debug(contents)

text looks like this (https://gist.github.com/davispuh/4726538)

 1.23%  somestring  specialstring                    specialstring
        |
        --- specialstring
           |          
           |--12.34%-- specialstring
           |          specialstring
           |          |          
           |          |--12.34%-- specialstring
           |          |          specialstring
           |          |          |          
           |          |          |--12.34%-- specialstring
           |          |           --1.12%-- [...]
           |          |          
           |           --2.23%-- specialstring
           |                     |          
           |                     |--12.34%-- specialstring
           |                     |          specialstring
           |                     |          specialstring
           |                     |          |          
           |                     |          |--12.34%-- specialstring
           |                     |          |          specialstring
           |                     |          |          specialstring
           |                     |           --1.23%-- [...]
           |                     |          
           |                      --1.23%-- [...]
           |                                 
            --1.05%-- [...]

 1.23%  somestring  specialstring                    specialstring
 2.34%  somestring  specialstring                    specialstring  
        |
        --- specialstring
            specialstring
            specialstring
           |          
           |--23.34%-- specialstring
           |          specialstring
           |          specialstring
            --34.56%-- [...]

        |
        --- specialstring
            specialstring
           |          
           |--12.34%-- specialstring
           |          |          
           |          |--100.00%-- specialstring
           |          |          specialstring
           |           --0.00%-- [...]
            --23.34%-- [...]

thanks :)

Solution

I was going to say the same thing as "the Tin Man". There has to be another format you can generate the data in.

If you want to parse this however... Parslet works like a map/reduce algorythm. You're first pass (parsing) is not intended to give you your final output, just to capture all the information you need from your source document.

Once you have that stored in a tree, you can then transform it to get the output you want.

So... I would write a parser that records each white space as a node, aswell as matching the text and percentages you need. I would group the white space nodes in an "indentation" node.

I would then use a transform to replace the whitespace nodes with a count of nodes to calculate the indentations.

Remember: Parslet generates a standard ruby hash. You can then write whatever code you like to make sense of this tree.

The parser is just converting the text file into a data-stucture you can manipulate.

Just to reiterate though. I think "the Tin Man" has the right answer.. generate the data in a machine readable way instead.

Update:

For an alternative approach you can check out: Indentation sensitive parser using Parslet in Ruby?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow