質問

I have the below text:

BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING VERIFICATION LETTER
Document Handle: 712826
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 261711
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558564.PDF
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING CODES COMPLIANCE LETTER
Document Handle: 712825
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 19441
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558563.pdf

I need to use regex (which will go in a C# program) to convert this into something effective for a CSV. The data that is most vital is the document handle and filename (path) from each section (being a section under "BEGIN:") I'm working on this for someone else, so I'd like to retain as much as possible in the event they decide they need some of the other data. This was my initial attempt:

\r\n(?!BEGIN).*\:

However, not every section has an "Ad Hoc:" component, which throws off the cell alignment when pulled into Excel. Ad Hoc I know for sure is not part of the data that is needed for the end result.

The best case scenario would be to just select and remove everything between every "Ad Hoc" and "Handle:" to be replaced with the delimiter (;). I would then pipe this along with my above regex.

My only other requirement is that this has to all be in one regex statement - otherwise in the program I've written I'll have to set up some sort of loop or while business which I'm not prepared to do yet.

役に立ちましたか?

解決

Based on what i understood from the comments underneath the question, the example data given in the question should be transformed into two text lines like this:

Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf

To achieve this result while avoiding a loop (although i wonder why you would want to avoid loops - they are basic and omni-present constructs), i would suggest applying two (or three, see section 3. below) regex substitutions.


1. Removal of "Label:" and replacement of line breaks with ";"

The first regular expression will remove a label in front of ":" including ":" and any preceding line break with a semicolon. However, it will not remove or replace a line break in front of "BEGIN:", and neither will it touch the "BEGIN:" itself.

@"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*"

Regular expression visualization

This regex is an OR-combination of two regex (which is easy to see in the visualization above):

[\r\n]+\s*Ad\sHoc:.*?[\r\n]+.*?:\s*

which will match Ad Hoc:" lines including any "Label:" string in the following line, and

([\r\n]+(?!\s*BEGIN)).*?:\s*

which will match any "Label:" including the line break in front of it, except for the "BEGIN:" label.

Applying this regex to your example and replacing all matches with ";" will result in the following:

BEGIN:;Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
BEGIN:;Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf

Note the "BEGIN:;" which we will take care of now.


2. Elimination of the "BEGIN:" labels

This is rather simple pattern when looking at the result of the first regex substitution.

"(?m)^BEGIN:;"

You might think that you can do this through a string replacement - and so did i when writing the first version of my answer. However, a mere string replacement would become a problem when "BEGIN:;" could be part of the content of any other text field. Better to be correct and safe by specifying a regex which matches only at the beginning of a line.


3. Code example, including elimination of empty lines in the source text

If you have empty lines containing white-spaces in the source text, the regular expression displayed above might not work properly. The solution is to do another regex substitution beforehand, which reduces empty lines (including white-spaces) to a single line break (if you are certain that your source data does not contain empty lines, you can omit this step).

A complete code example, which would produce the result as mentioned at the beginning of my answer, could look like this:

string sourceData = ... your text with the source data ...

Regex reEmptyLines = new Regex(@"[\s\r\n]+[\r\n]", RegexOptions.Compiled);
Regex reSemicolons = new Regex(@"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*", RegexOptions.Compiled);
Regex reBegin = new Regex("(?m)^BEGIN:;", RegexOptions.Compiled);

string processed =
    reBegin.Replace(
        reSemicolons.Replace(
            reEmptyLines.Replace(sourceData, "\r\n"),
            ";"
        ),
        string.Empty
    );

他のヒント

You can use the regex, but I wouldn't say it is easier than doing it in cycle manually.

(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)

Regular expression visualization

Sample code:

foreach (Match m in Regex.Matches(text, @"(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)"))
{
    Console.WriteLine(string.Join(",", m.Groups["value"].Captures.Cast<Capture>().Select(c => c.Value)));
}

Output:

Zoning Letter,4/16/2014,355211,712826,102,367,0,261711,0,1,0,0,16,0,2,0,0,\V367\2855\1558564.PDF
Zoning Letter,4/16/2014,355211,712825,102,367,0,19441,0,1,0,0,16,0,2,0,0,\V367\2855\1558563.pdf

How's this:

BEGIN:((?:(?!BEGIN:).)*)

This would match everything between the first BEGIN and the next.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top