Matching double line breaks using Regex
Question
I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.
I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.
Note: I have selected the Multiline option on Regex Hero.
(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)
This is the example text I am using to match.
----------------------------------------------------------------------
1073 Document line action code [B]
Desc: Code indicating an action associated with a line of a
document.Repr: an..3
1 Included in document/transaction
The document line is included in the
document/transaction.
should capture this as well.2 Excluded from document/transaction
The document line is excluded from the
document/transaction.
What I want is for codeComment to contain the following:
The document line is included in the
document/transaction.
should capture this as well.
but it is only extracting the first line:
The document line is included in the
Solution
In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:
^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)
(?s)
switches on singleline mode (to allow the dot to match newlines).
(?!\n\n)
asserts that there are no two consecutive linebreaks at the current position.
OTHER TIPS
try
[\r\n]{2,}
To "match double line breaks"
Used in DWR to remove double/bloated line breaks (left over from unzipping files for some reason)
more info: How to remove unwanted "extra line breaks" that appear in PHP/CSS/JS files after unzip?
This one is simple and works best for me:
/[\r]?\n[\r]?\n/g