when using pattern matching in Lua with parenthesis, how does one use "%2" to get the capture group

StackOverflow https://stackoverflow.com/questions/17863722

Question

I am trying to parse a text file and converting it to a table (or JSON) using lua. Example test file is as follows:

ipv4     2 tcp      6 3598 ESTABLISHED src=192.168.1.117 dst=137.194.2.78 sport=59078 dport=80 packets=4 bytes=298 src=137.194.2.78 dst=132.227.127.212 sport=80 dport=59078 packets=3 bytes=567 [ASSURED] mark=0 use=2
ipv4     2 udp      17 55 src=192.168.1.117 dst=157.56.149.60 sport=49991 dport=3544 packets=5 bytes=445 [UNREPLIED] src=157.56.149.60 dst=132.227.127.212 sport=3544 dport=49991 packets=0 bytes=0 mark=0 use=2
ipv4     2 tcp      6 3420 ESTABLISHED src=192.168.1.104 dst=193.51.224.187 sport=35918 dport=443 packets=19 bytes=2521 src=193.51.224.187 dst=132.227.127.212 sport=443 dport=35918 packets=16 bytes=9895 [ASSURED] mark=0 use=2
ipv4     2 udp      17 59 src=192.168.1.117 dst=192.168.1.255 sport=17500 dport=17500 packets=139 bytes=23908 [UNREPLIED] src=192.168.1.255 dst=192.168.1.117 sport=17500 dport=17500 packets=0 bytes=0 mark=0 use=2
...

Notice that the data in each line can be split into two parts based on the direction (forward and reverse path flows). In case you have a linux system/openwrt router, you may get a similar test file using the conntrack command, or by reading /proc/net/nf_conntrack.

What I wish to retrieve is the following information:

{ 1:
    {
    "bytes":    298,
    "src":      "192.168.1.117",
    "sport":    59078,
    "layer4":   "tcp",
    "dst":      "137.194.2.78",
    "dport":    80,
    "layer3":   "ipv4",
    "packets":  4,
    "rbytes":   567,
    "rpackets": 3
    },
{ 2: ...

where rbytes, rpackets are for the bytes and packets in the reverse direction (second half of line 1 in my example text file).

My parser is as follows:*

function conntrack(callback)
local connt = {}
if io.open("conntrack.temp", "r") then

    for line in io.lines("conntrack.temp") do
            line = line:match("^(.-( [^ =]+=).-)%2")
        local entry, flags = _parse_mixed_record(line, " +")

            if flags[6] ~= "TIME_WAIT" then
                entry.layer3 = flags[1]
                entry.layer4 = flags[3]
                for i=1, #entry do
                    entry[i] = nil
                end
                if callback then
                    callback(entry)
                else
                    connt[#connt+1] = entry
                end
            end
    end   
else
    return nil
end
return connt
end

function _parse_mixed_record(cnt, delimiter)
delimiter = delimiter or "  "
local data = {}
local flags = {}

for i, l in pairs(cnt:split("\n")) do
    for j, f in pairs(l:split(delimiter)) do
        local k, x, v = f:match('([^%s][^:=]*) *([:=]*) *"*([^\n"]*)"*')
        if k then
            if x == "" then
                table.insert(flags, k)
            else
                data[k] = v
            end
        end
    end
end

return data, flags
end

Calling the above function (after including a simple split method in the code), I can parse the file only upto the first half of each line. So basically, no rbytes or rpackets are parsed. I know the code responsible for this is

line = line:match("^(.-( [^ =]+=).-)%2")

A print(line) statement following this line in code shows me:

ipv4 2 tcp 6 3598 ESTABLISHED src=192.168.1.117 dst=137.194.2.78 sport=59078 dport=80 packets=4 bytes=298

So, the statement splits each line of the file using a confusing pattern matching which I kind of understand after playing around with it a bit. The part I still don't get is the %2 which occurs after capturing the pattern. I know it is used to somehow access the pattern caught, but how should I change this statement so that line contains both the forward path bytes and packet count, as well as the reverse path? My main question is: what exactly is the pattern in this statement? I'm probably going to remove this line to parse the whole statement, but I wanted to understand why the original coders are doing this.

I've been through the lua pattern matching manual but I'm still confused on capturing output with %<some_number>. Why doesn't %1 or %3 work?

Two relevant stackoverflow questions I found: Q1, Q2. A deeper explanation would be appreciated.

Also, currently I can't recover the timeout value (5th word in line1 3598) or the connection state (ESTABLISHED, [ASSURED]) with the code I have provided here. I'm still a beginner at lua and hope to crack this soon.

*NOTE: This parser is my fixed version of the one available in the luci sys module on openwrt routers. See original luci.sys sourcecode for details.

While working with on attitude adjustment 12.09, I noticed that their net.conntrack() isn't working due to a failure in parsing the object to a proper JSON format. The relevant function using this pattern is given in the sys.lua file, called function conntrack(callback) and internal function _parse_mixed_record(cnt, delimiter). My router used luci-0.11 and lua 5.1.4.

Was it helpful?

Solution

That pattern was designed to keep only the forward part of each line. Here's how it does that. The second parenthesis, ( [^ =]+=), captures the first substring of the form " stuff=". Then the %2 at the end of the pattern will only match if that same, string, " stuff=" appears again. So on a line like

ipv4     2 tcp      6 3598 ESTABLISHED src=192.168.1.117 dst=137.194.2.78 sport=59078 dport=80 packets=4 bytes=298 src=137.194.2.78 dst=132.227.127.212 sport=80 dport=59078 packets=3 bytes=567 [ASSURED] mark=0 use=2

the second capture will be " src=", so the first capture, which is what is assigned to line, will be the whole initial portion of the line until just before the second time src= appears, that is, this initial part:

ipv4     2 tcp      6 3598 ESTABLISHED src=192.168.1.117 dst=137.194.2.78 sport=59078 dport=80 packets=4 bytes=298

If you wanted to get the second half too, and assign it to a different variable, you could replace the line = ... statement with

line1, _, line2 = line:match("^(.-( [^ =]+=).-)(%2.*)$")

This would assign to line1 the first half of the line (as was previosly assigned to line), and to line2, the remainder, sarting from the second occurence of " src=". For the example line above, you'd get

line1 = "ipv4     2 tcp      6 3598 ESTABLISHED src=192.168.1.117 dst=137.194.2.78 sport=59078 dport=80 packets=4 bytes=298"
line2 = " src=137.194.2.78 dst=132.227.127.212 sport=80 dport=59078 packets=3 bytes=567 [ASSURED] mark=0 use=2"

Note: The _ in between line1 and line2 is there to catch the second capture (which here is the string " src="), remember that match returns all captures, in order, whether you want them or not.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top