Question

I am trying to parse the following string, but I can't seem to figure out how to use regex negation on words. I have the following text message history.

string = '2014-03-29 10:29:24 AM: John Doe: Hey dude how are you feeling 2014-03-29 10:30:39 AM: Billy: Hey Doe, Im feeling better now. 2014-03-29 10:30:58 AM: Billy: Yup'

My regex ruby query currently looks like this:

string.scan(/((\d{4}-\d{2}-\d{2}\s+\d{2}\:\d{2}\:\d{2}\s+[AP][M])\:\s(.*?)\:\s([^\d{4}]*))/) {|match| puts match}

Output:
   2014-03-29 10:29:24 AM: John Doe: Hey dude how are you feeling 
   2014-03-29 10:29:24 AM
   John Doe
   Hey dude how are you feeling 
   2014-03-29 10:30:39 AM: Billy: Hey Doe, Im feeling better now. 
   2014-03-29 10:30:39 AM
   Hason
   Hey Doe, Im feeling better now. 
   2014-03-29 10:30:58 AM: Billy: Yup
   2014-03-29 10:30:58 AM
   Billy
   Yup

Problem

My Regex negation is only for characters and not the words. [^\d{4}] only stops when it detects any number and not the word like '2014'.

Was it helpful?

Solution

Try the below one. I have used ?: to avoid few of the group capturing from your regex. Also, added a positive lookahead (?=\d{4}-|$) to see whether there is any \d\d\d\d- format is available at next or its the end of line. You can change this into your pattern if you want(i mean make it into yyyy-mm-dd format).

string.scan(/((?:\d{4}-\d{2}-\d{2}\s+\d{2}\:\d{2}\:\d{2}\s+[AP][M])\:\s(?:.*?)\:.*?)(?=\d{4}-|$)/) {|match| puts match}

Output:

2014-03-29 10:29:24 AM: John Doe: Hey dude how are you feeling 
2014-03-29 10:30:39 AM: Billy: Hey Doe, Im feeling better now. 
2014-03-29 10:30:58 AM: Billy: Yup

OTHER TIPS

You can match all until a specific "word" like this, example with the word "2014":

(?>[^2]+|2(?!014))*

The same with an unknow year (four digits):

(?>[^0-9]+|[0-9](?![0-9]{3}-))*

An other way is to split the string with a lookahead:

string.split(/(?=\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}\s+[AP][M]:)/)

Note: for these three patterns, you can choose how you want to be specific with the length and the precision of the subpattern inside the lookahead assertion.

Character classes don't work like this.

[^\d{4}]*

Meaning:

^    -  Negative class, so -
\d   -  Not digit 0-9
{    -  Not '{'
4    -  Not '4'
}    -  Not '}'

And, the class optionally matches this set many times.
Therefore, it stops, and won't match a number Not digit 0-9

Match until a 4 digit word could also be like this

 (                             # (1 start)
      (                             # (2 start)
           \d{4} - \d{2} - \d{2} 
           \s+ 
           \d{2} \: \d{2} \: \d{2} 
           \s+ 
           [AP] [M] 
      )                             # (2 end)
      \: \s 
      ( .*? )                       # (3)
      \: \s 
      (                             # (4 start)
           (?:
                (?! \d{4} )              # Not 4 digits ahead of this character
                .                        # Ok, match the character
           )*
      )                             # (4 end)
 )                             # (1 end)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top