Question

I'm trying to split big file with some regex. Problem is that I want to keep delimiter in text after split, and I tried to add ?= on the beggining of regex, but then it doesn't split. I tried modified regex in Sublime, and it's working there.

Text is like this:

Aug 07, 2014 01:01:01 PM
some text
Aug 07, 2014 02:02:02 PM


So, date, then some text and date. I want to get split text with regex which recognize that date.

First version of regex, which works perfectlly for my purpose:

\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)

Code in Python is this:

allparts = re.compile(r'\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].').split(alltext)

After adding ?=, it looks like this:

allparts2 =re.compile(r'(?=\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)').split(alltext)

What I'm doing wrong in second code?

Was it helpful?

Solution

Sorry, my first answer was wrong:) Try not adding ?=, only put it in parentheses like this:

allparts2 =re.compile(r'(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)').split(alltext)

Then try it without compile...

allparts2 = re.split('(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM " 

allparts2 = re.split('(?=\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)
print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM ']

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM "


allparts2 = re.split('(?:\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['', ' some text ', ' another text ', ' ']

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM "


allparts2 = re.split('(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['', 'Aug 07, 2014 01:01:01 PM', ' some text ', 'Aug 07, 2014 02:02:02 PM', ' another text ', 'Aug 07, 2014 03:03:03 AM', ' ']

Just to compare different forms.

OTHER TIPS

Although I am unfamiliar with the Python flavour, Pythex gives me the following, I assume correct, results :

See the result

Even if these are not, there are several things in your regex which are unnecessary and/or incorrect by my knowledge.

  • A comma does not need to be escaped
  • A conditional is not done by [ condo | cond2] , but rather by parentheses (cond1|cond2)
  • The \s you have is optional as regex catches a white space, which is correct if you want to catch e.g. a space character, a tab character, a carriage return character, ..

Lastly, the item you are adding ?= is a lookahead, ?: makes it match, but does not make it part of your capture group.

Try this regex : (?:\w{3} \d{2}, \d{4}, [\d:]+ (?:AM|PM))

It seems that python's re.split() doesn't split on zero-length matches.

However, the manual says

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

...

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string.

So you can use :

allparts2 = re.compile(r'(\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s(?:AM|PM))')

Where the matching expression is surrounded by a capturing group (also notice the un-capturing group at the end). The result is :

['', 'Aug 07, 2014 01:01:01 PM', ' some text ', 'Aug 07, 2014 02:02:02 PM', ' another text ', 'Aug 07, 2014 03:03:03 AM', ' ']

You can then create your files by grouping allparts[1], allparts[2] and so on (2n+1, 2n+2).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top