Question

I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.

I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.

Note: I have selected the Multiline option on Regex Hero.

(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)

This is the example text I am using to match.

----------------------------------------------------------------------

  • 1073 Document line action code [B]

    Desc: Code indicating an action associated with a line of a
        document.

    Repr: an..3

    1 Included in document/transaction
        The document line is included in the
        document/transaction.
        should capture this as well.

    2 Excluded from document/transaction
        The document line is excluded from the
        document/transaction.

What I want is for codeComment to contain the following:

The document line is included in the
          document/transaction.
          should capture this as well.

but it is only extracting the first line:

The document line is included in the
Was it helpful?

Solution

In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:

^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)

(?s) switches on singleline mode (to allow the dot to match newlines).

(?!\n\n) asserts that there are no two consecutive linebreaks at the current position.

OTHER TIPS

try

    [\r\n]{2,}

To "match double line breaks"

Used in DWR to remove double/bloated line breaks (left over from unzipping files for some reason)

more info: How to remove unwanted "extra line breaks" that appear in PHP/CSS/JS files after unzip?

This one is simple and works best for me:

/[\r]?\n[\r]?\n/g
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top