Question

I'm trying to write a regular expression to pull blocks of text out of the history files I keep on projects I'm building. At the moment I'm planning on doing this extraction manually in my text editor (either textmate or sublimetext 2), but eventually I'll build this into a scripted process using either python or php (haven't decided yet).

All of the history entries in my history file have the format:

YYYY-MM-DD - Chris -- Version: X.X.X
====================================
- Lorem ipsum dolor sit amet, vim id libris epicuri
- Et eos veri quodsi appetere, an qui saepe malorum eloquentiam.
...

--

Where X is the version number that the work was done under.

I'm trying to pull everything from the version number to the final double dash delimiter which denotes the end of the block of text.

I started by creating the regular expression statement to select the section heading which works:

(^[\d]{4}-[\d]{2}-[\d]{2}\s-\s[\w]+\s--\sVersion:\s)[\d\.]+$

But when I try to turn the pattern within my parenthesis into the look behind it fails:

(?<=^[\d]{4}-[\d]{2}-[\d]{2}\s-\s[\w]+\s--\sVersion:\s)[\d\.]+$ 

I've been looking around and so far it seems like this lookbehind format is correct. I can't seem to figure out what I'm missing. Any ideas?

Was it helpful?

Solution 2

Neither PHP nor Python allow arbitrary-length look-behind. So as soon as you have a quantifier like + in there it ceases to work.

So your first attempt is the only thing that will work here.

OTHER TIPS

As Joey stated, there is no arbitrary-length lookbehind in php or python. But there is a workaround in PHP ! The \K escape sequence.

From the docs :

The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:

   foo\Kbar

matches "foobar", but reports that it has matched "bar". This feature is similar to a lookbehind assertion (described below). However, in this case, the part of the subject before the real match does not have to be of fixed length, as lookbehind assertions do.

After removing some redundant brackets [], your expression would look like

(?m)^\d{4}-\d{2}-\d{2}\s-\s\w+\s--\sVersion:\s\K[\d.]+$

Online demo

Notes:

  • (?m) : is an inline regex modifier
  • You don't need to escape a dot . in a character class : [.] will match a dot and not any character
  • You may add some quantifiers to the white-space characters : \s* or \s+
  • \w+ will also match underscore _, so to exclude it you may use [^\W_]+
  • Regex is awesome
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top