Using regular expression to extract a word…if it exists

https://stackoverflow.com/questions/11582057

22-06-2021
|

Question

I want to use RE to parse a log file and return the orderid if it exists. For example:

Here is a sample log

2012-07-19 12:05:04,288 [22] INFO  AddQueueCommand [(null)] - Status set to Subscribed
2012-07-19 12:05:04,288 [23] INFO  FooBarProviderFactory [(null)] - Missing Function : OrderId:102602 : Method:AddOrderToId : application:11
2012-07-19 12:05:04,288 [22] INFO  AddQueueCommand [(null)] - Status set to Pending
2012-07-19 12:05:04,288 [23] INFO  AddSubscription [(null)] - Subscription Added. OrderId:102603 : application:15
2012-07-19 12:05:04,288 [22] INFO  AddQueueCommand [(null)] - Status set to Subscribed

What I want to do is use a regular expression so I can parse the components of the log message. But when an "OrderId" exists, I want to be able to parse the orderid #.

Here is what I have so far:

^
(?<before>.*)
(?<order>((?<=OrderId\:\s*)\d*))
(?<after>.*)
$

which works great for parsing the orderids for lines that have them, but it fails when the line doesn't have them. I tried adding the "?" zero or one to the order row which then parses all the rows, but never parses the actual orderid. They are always null.

Hope someone can see what I am doing wrong. Thanks!

(I want it to parse every line because I am going to parse multiple ids values from each row and they may or may not exist. I want it to return the value if what I am searching for exists or null/empty if it doesn't exist. It needs to return something for every row. This will be plugged into LogParser so we can query or logs for specific orders or other variables)

Solution

If you make the <order> group optional, then the <before> group will always match the entire line, so the match succeeds without capturing an OrderId even if it is there. Making it lazy won't help ((?<before>.*?)) in this case (because then the <after> group matches everything).

But you can do all that you want in a single regex, if I understand you correctly. For example, assuming you want the word after Status set to (if present) and the number after OrderId: (if present) for each line, then you can use the regex

^
(?(?=.*Status\sset\sto\s)(?=.*Status\sset\sto\s(?<status>\w+))|)
(?(?=.*OrderId:)(?=.*OrderId:(?<order>\d+))|)

on each line and check whether the groups <status> and/or <order> have matched. Expand as necessary.

This assumes that your regex engine supports conditionals, which is the case for .NET, Perl and PCRE.

Explanation:

(?               # Conditional: IF it's possible to match...
 (?=.*OrderId:)  #  any string, followed by "OrderId:"
                 # THEN try to match this:
  (?=            #  Lookahead assertion:
   .*OrderId:    #   any string, followed by "OrderId:" 
   (?<order>\d+) #   followed by a number --> capture in group <status>
  )              #  End of lookahead
 |               # ELSE try to match this:
                 #  the empty string (always succeeds)
)                # End of conditional.

The reason why we need two lookaheads right after each other: We don't want the regex engine to actually consume any characters in the current line because we don't know which order the entries will be in, so each search needs to start at the beginning of the line. (If, on the other hand, the order of entries is always fixed, then the regex can be simplified a bit).

OTHER TIPS

Use a String Scan Method

If all you want is the OrderId record numbers, you can simplify this task by scanning the string for a matching expression. For example, assuming your log data was stored in String log, in Ruby you can do the following:

log.scan /OrderId:(\d+)/
=> [["102602"], ["102603"]]

Only the matched text will be stored. No array values will be stored for lines without a match.

You can return a list of ids with comprehension list :

 >>> import re
 >>> [ re.sub( r".*OrderId:(\d*).*", r"\1", line ) for line in logs.readlines() if 'OrderId' in line ]
 ['102602', '102603']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow