Parsing out multiple lines with LPeg in Lua

Question 1

LPEG version:

local lpeg            = require "lpeg"
local lpegmatch       = lpeg.match
local C, Ct, P, R, S  = lpeg.C, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S
local Cg              = lpeg.Cg

local data_to_arrays

do
  local colon    = P":"
  local lbrak    = P"["
  local rbrak    = P"]"
  local digits   = R"09"^1
  local eol      = P"\n\r" + P"\r\n" + P"\n" + P"\r"
  local ws       = S" \t\v"
  local optws    = ws^0
  local getnum   = C(digits) / tonumber * optws
  local start    = lbrak * optws * eol
  local stop     = optws * rbrak
  local line     = optws * digits * colon * optws
                 * getnum * getnum * getnum * getnum
                 * getnum * getnum * getnum * getnum
                 * eol
  local count    = optws * P"Total count:" * optws * getnum * eol
  local inner    = Ct(line^1 * count^-1)
--local inner    = Ct(line^1 * Cg(count, "count")^-1)
  local array    = start * inner * stop
  local extract  = Ct((array + 1)^0)

  data_to_arrays = function (data)
    return lpegmatch (extract, data)
  end
end

This actually works only if there are exactly eight integers on each line of the data block. Depending on how well formed your input is this may be a curse or a blessing ;-)

And a test file:

data = [[
some text
[    
some text
         [
                  0: 0 0 0 0 0 0 0 0 
                  8: 0 0 0 0 0 0 0 0 
                 16: 0 0 0 9 343 3938 9433 8756 
                 24: 6270 4472 3182 2503 1768 1140 836 496 
                 32: 326 273 349 269 144 121 94 82 
                 40: 64 80 66 59 56 47 50 46 
                 48: 64 35 42 53 42 40 41 34 
                 56: 35 41 39 39 47 30 30 39 
                 Total count: 12345
        ]
    some text
]
some text
[
 some text
   [
              0: 0 0 0 0 0 0 0 0 
              8: 0 0 0 0 0 0 0 0 
             16: 0 0 0 4 212 3079 8890 8941 
             24: 6177 4359 3625 2420 1639 974 594 438 
             32: 323 286 318 296 206 132 96 85 
             40: 65 73 62 53 47 55 49 52 
             48: 29 44 44 41 43 36 50 36 
             56: 40 30 29 40 35 30 25 31 
             64: 47 31 25 29 24 30 35 31 
             72: 28 31 17 37 35 30 20 33 
             80: 28 20 37 25 21 23 25 36 
             88: 27 35 22 23 15 24 34 28 
    ]
    some text
some text
]
]]

local arrays = data_to_arrays (data)

for n = 1, #arrays do
  local ar   = arrays[n]
  local size = #ar
  io.write (string.format ("[%d] = { --[[size: %d items]]\n  ", n, size))
  for i = 1, size do
    io.write (string.format ("%d,%s", ar[i], (i % 5 == 0) and "\n  " or " "))
  end
  if ar.count ~= nil then
    io.write (string.format ("\n  [\"count\"] = %d,", ar.count))
  end
  io.write (string.format ("\n}\n"))
end

Question 2

Try this code, which does no use LPEG:

-- assume T contains the text
local a={}
local i=0
for b in T:gmatch("%b[]") do
        b=b:gsub("%d+:","")
        i=i+1
        local t={}
        local j=0
        for n in b:gmatch("%d+") do
                j=j+1; t[j]=tonumber(n)
        end
        a[i]=t
end

Question 3

My pure Lua string library solution would be something like this:

local bracket_pattern = "%b[]" --pattern for getting into brackets
local number_pattern = "(%d+)%s+" --pattern for parsing numbers
local output_array = {} --output 2-dimensional array
local i = 1
local j = 1
local tmp_number
local tmp_sub_str

for tmp_sub_str in file_content:gmatch(bracket_pattern) do --iterating through [string]
    table.insert(output_array, i, {}) --adding new [string] group
    for tmp_number in tmp_sub_str:gmatch(number_pattern) do --iterating through numberWHITESPACE
        table.insert(output_array[i], tonumber(tmp_number)) --adding [string] group element (number)
    end
    i = i + 1
end

EDIT: This does work properly with an uptaded file format either.

Question 4

phg already provided a nice LPeg solution for your question but here's another one using LPeg's re module. The syntax is closer to BNF and the operators used are more 'regex' like so this solution may be easier to grok.

re = require 're'

function dump(t)
  io.write '{'
  for _, v in ipairs(t) do
    io.write(v, ',')
  end
  io.write '}\n'
end

local textformat = [[
  data_in   <-  block+
  block     <-  text '[' block_content ']'
  block_content <- {| data_arr |} / (block / text)*
  data_arr  <- (text ':' nums whitesp)+
  text      <- whitesp [%w' ']+ whitesp
  nums      <- (' '+ {digits} -> tonumber)+
  digits    <- %d+
  whitesp   <- %s*
]]
local parser = re.compile(textformat, {tonumber = tonumber})
local arr1, arr2 = parser:match(data)

dump(arr1)
dump(arr2)

Each block of data array gets captured into a separate table and returned as one of the outputs by match.

With data being the same input as above, two blocks are matched and captured and so 2 tables are returned. Inspecting these two tables gives:

{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,343,3938,9433,8756,6270,4472,3182,2503, 1768,1140,836,496,326,273,349,269,144,121,94,82,64,80,66,59,56,47,50,46,64,35,42 ,53,42,40,41,34,35,41,39,39,47,30,30,39,12345,} {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,212,3079,8890,8941,6177,4359,3625,2420, 1639,974,594,438,323,286,318,296,206,132,96,85,65,73,62,53,47,55,49,52,29,44,44, 41,43,36,50,36,40,30,29,40,35,30,25,31,47,31,25,29,24,30,35,31,28,31,17,37,35,30 ,20,33,28,20,37,25,21,23,25,36,27,35,22,23,15,24,34,28,}

Question 5

I know this is a late reply but defining much less grammar the following pattern finds opening [ and captures every number that is not suffixed by : until a closing ] is reached. Then repeats the whole block until nothing is matched.

local patt = re.compile([=[
    data    <- {| block |}+
    block   <- ('[' ((%d+ ':') / { %d+ } -> int / [^]%d]+)+ ']') / ([^[]+ block)
]=], { int = tonumber })

You can capture all recovered arrays at once in a table with something like this

local a = { patt:match[=[ ... ]=] }