문제

I have this Tcl8.5 code:

set regexp_str {^[[:blank:]]*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec[[:blank:]]*$}

set subject {
[  5]  0.0- 1.0 sec    680 KBytes  5.57 Mbits/sec
[  5]  0.0-150.0 sec    153 MBytes  8.56 Mbits/sec
[  4]  0.0- 1.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  0.0-150.4 sec  38.6 MBytes  2.15 Mbits/sec
}
set matches [regexp -line -inline -all -- $regexp_str $subject]

$matches populates with the matched data on one machine, while the other simply gets an empty list.
Both machines have Tcl8.5.

Using the -about flag of regexp, the following list is returned: 3 {REG_UUNPORT REG_ULOCALE}

I don't understand how could this be possible and what else should I do to debug it?


Edit #1, 17 Feb 07:00 UTC:

@Donal Fellows:
The patch level on the "good" machine is 8.5.15.
The patch level on the "bad" machine is 8.5.10.

I'm familiar with \s and \d, but as far as I know (please correct me), they both mean to a broader characters range than I need to:
\s includes newlines, which in my example mustn't exists.
\d includes Unicode digits, which I will not encounter in my example.
In regexp I generally prefer to be as specific as possible to avoid cases I didn't think of..


There's something which I didn't specify and could be important:
The variable $subject is populated using the expect_out(buffer) variable, following a grep command executed in shell. expect_out(buffer) returns the output from a ssh session that is tunneled using a proxy called netcat (binary name is nc):

spawn ssh -o "ProxyCommand nc %h %p" "$username@$ipAddress"

In general, the output received & sent on this session is only ASCII/English characters. The prompt of the destination PC contains control characters like ESC and BEL and they are contained in $subject. I don't think of it to be a problem because that I tested the regular expression with all of these characters and it worked OK.

Thank you guys for the elaborated info!


Edit #2, 17 Feb 11:05 UTC:

Response to @Donal Fellows:
Indeed I've tried:

set regexp_str {^[[:blank:]]*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec[[:blank:]]*$}
puts [regexp -line -inline -all -- $regexp_str [string map {\r\n \n \r \n} $subject]]

and got (please ignore the different numbers in the output, the idea is the same):

{[  5]  0.0-150.0 sec  86.7 MBytes  4.85 Mbits/sec} 150.0 4.85 M {[  4]  0.0-150.8 sec  60.4 MBytes  3.36 Mbits/sec} 150.8 3.36 M

Also I tried to replace the [[:blank:]] from both sides of regexp string with \s:

set regexp_str {^\s*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec\s*$}
puts [regexp -line -inline -all -- $regexp_str $subject]

and it finally found what I needed:

{[  5]  0.0-150.0 sec  86.7 MBytes  4.85 Mbits/sec
} 150.0 4.85 M {[  4]  0.0-150.8 sec  60.4 MBytes  3.36 Mbits/sec
} 150.8 3.36 M
도움이 되었습니까?

해결책

Tcl uses the same regular expression engine on all platforms. (But double-check whether you've got the same patchlevel on the two machines; that'll let us examine what — if any — exact code changes might there be between the systems.) It also shouldn't be anything related to newline terminators; Tcl automatically normalizes them under anything even remotely resembling normal circumstances (and in particular, does so in scripts).

With respect to the -about flags, only the 3 is useful (it's the number of capture groups). The other item in the list is the set of state flags set about the RE by the RE compiler, and frankly they're only useful to real RE experts (and our test suite). I've never found a use for them!


You can probably shorten your RE by using \s (mnemonically “spaces”) instead of that cumbersome [[:blank:]] and \d (“digits”) instead of [0-9]. When I do that, I get something quite a lot shorter and so easier to understand.

set regexp_str {^\s*\[\s*\d+\]\s+0\.0-(\d+\.\d+) sec.+(\d+\.\d+) ([MK]?)bits/sec\s*$}

It produces the same match groups.


[EDIT]: Even with the exact version of the code you report, checked out directly from the source code repository tag that was used to drive the 8.5.10 distribution, I can't reproduce your problem. However, the fact that it's really coming from an Expect buffer is really helpful; the problem may well actually be that the line separation sequence is not a newline but rather something else (CRLF — \r\n — is the number 1 suspect, but a plain carriage return could also be there). Expect is definitely not the same as normal I/O for various reasons (in particular, exact byte sequences are often needed in terminal handling).

The easiest thing might be to manually standardize the line separators before feeding the string into regexp. (This won't affect the string in the buffer; it copies, as usual with Tcl.)

regexp -line -inline -all -- $regexp_str [string map {\r\n \n \r \n} $subject]

It's also possible that there are other, invisible characters in the output. Working out what is really going on can be complex, but in general you can use a regular expression to test this theory by looking to see if the inverse of the set of expected characters is matchable:

regexp {[^\n [:graph:]]} $subject

When I try with what you pasted, that doesn't match (good!). If it does against your real buffer, it gives you a way to hunt the problem.

다른 팁

I saw that you are missing optional space(s) right after the first dash. I inserted those optional spaces in and all is working:

set regexp_str {^[[:blank:]]*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-[[:blank:]]*([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec[[:blank:]]*$}
# missing -->                                                        ^^^^^^^^^^^^

set subject {
[  5]  0.0- 1.0 sec    680 KBytes  5.57 Mbits/sec
[  5]  0.0-150.0 sec    153 MBytes  8.56 Mbits/sec
[  4]  0.0- 1.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  0.0-150.4 sec  38.6 MBytes  2.15 Mbits/sec
}   
set matches [regexp -line -inline -all -- $regexp_str $subject]

puts "\n\n"
foreach {all a b c} $matches {
    puts "- All: >$all<"
    puts "       >$a<"
    puts "       >$b<"
    puts "       >$c<"
}   

Output

- All: >    [  5]  0.0- 1.0 sec    680 KBytes  5.57 Mbits/sec<
       >1.0<
       >5.57<
       >M<
- All: >    [  5]  0.0-150.0 sec    153 MBytes  8.56 Mbits/sec<
       >150.0<
       >8.56<
       >M<
- All: >    [  4]  0.0- 1.0 sec  0.00 Bytes  0.00 bits/sec<
       >1.0<
       >0.00<
       ><
- All: >    [  4]  0.0-150.4 sec  38.6 MBytes  2.15 Mbits/sec<
       >150.4<
       >2.15<
       >M<

Update

When dealing with complex regular expression, I often break up the expression into several lines and add comments. The following is equivalent to my previous code, but more verbose and easier to troubleshoot. The key is to use and additional flag to the regexp command: the -expanded flag, which tells regexp to ignore any white spaces and comments in the expression.

set regexp_str {
    # Initial blank
    ^[[:blank:]]*

    # Bracket, number, optional spaces, bracket
    \[[[:blank:]]*[0-9]+\]

    # Spaces
    [[:blank:]]+

    # Number, dash, number
    0\.0\-[[:blank:]]*([0-9]+\.[0-9]+)

    # Unwanted stuff
    [[:blank:]]sec.+

    # Final number, plus unit
    ([0-9]+\.[0-9]+)[[:blank:]]([MK]?)bits/sec

    # Trailing spaces
    [[:blank:]]*$
}   

set subject {
[  5]  0.0- 1.0 sec    680 KBytes  5.57 Mbits/sec
[  5]  0.0-150.0 sec    153 MBytes  8.56 Mbits/sec
[  4]  0.0- 1.0 sec  0.00 Bytes  0.00 bits/sec
[  4]  0.0-150.4 sec  38.6 MBytes  2.15 Mbits/sec
}   

set matches [regexp -expanded -line -inline -all -- $regexp_str $subject]

puts "\n\n"
foreach {all a b c} $matches {
    puts "- All: >$all<"
    puts "       >$a<"
    puts "       >$b<"
    puts "       >$c<"
}   

(ETA: the question is about regular expressions, so why am I talking about massaging a string into a list and picking items out of that? See the end of this answer.)

As a workaround, if you don't really need to use a regular expression, this code gives the exact same result:

set result [list]
foreach line [split [string trim $subject] \n] {
    set list [string map {- { } / { }} $line]
    lappend result \
        $line \
        [lindex $list 3] \
        [lindex $list 7] \
        [string map {Mbits M Kbits K bits {}} [lindex $list 8]]
}

The lines aren't strictly well-formed lists because of the brackets, but it does work.

To clarify:

  • the string trim command takes out the newlines before and after the data: they would otherwise yield extra empty elements
  • the split command creates a list of four elements, each corresponding to a line of data
  • the foreach command processes each of those elements
  • the string map command changes each - or / character into a space, essentially making it a (part of a) list item separator
  • the lappend incrementally builds the result list out of four items per line of data: the items are the whole line, the fourth item in the corresponding list, the eight item in the corresponding list, and the ninth item in the corresponding list after the string map command has shortened the strings Mbits, Kbits, and bits to M, K, and the empty string, respectively.

The thing is (moderate rant warning): regular expression matching isn't the only tool in the string analysis toolbox, even though it sometimes looks that way. Tcl itself is, among other things, a powerful string and list manipulation language, and usually far more readable than RE. There is also, for instance, scan: the scan expression "[ %*d] %*f- %f sec %*f %*s %f %s" captures the relevant fields out of the data strings (provided they are split into lines and processed separately) -- all that remains is to look at the last captured string to see if it begins with M, K, or something else (which would be b). This code gives the same result as my solution above and as your example:

set result [list]
foreach line [split [string trim $subject] \n] {
    scan $line "\[ %*d\] %*f- %f sec %*f %*s %f %s" a b c
    lappend result $line $a $b [string map {its/sec {} Mb M Kb K b {}} $c]
}

Regular expressions are very useful, but they are also hard to get right and to debug when they aren't quite right, and even when you've got them right they're still hard to read and, in the long run, to maintain. Since in very many cases they are actually overkill, it makes sense to at least consider if other tools can't do the job instead.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top