Question

I'm looking for a little guidance with RegEx patterns.

I have a pipe delimited file which I and I want to remove all lines where the fourth cell is blank. Each line can have any number of cells.

My code so far:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace EpicRemoveBlankPriceRecords
{
    class Program
    {
        static void Main(string[] args)
        {
            string line;

            // Read the file and display it line by line.
            System.IO.StreamReader inFile = new System.IO.StreamReader("c:\\test\\test.txt");
            System.IO.StreamWriter outFile = new System.IO.StreamWriter("c:\\test\\test_out.txt");
            while ((line = inFile.ReadLine()) != null)
            {
                Match myMatch = Regex.Match(line, @".*\|.*\|.*\|\|.*");
                if (!myMatch.Success)
                {
                    outFile.WriteLine(line);
                }
            }

            inFile.Close();
            outFile.Close();

            //// Suspend the screen.
            //Console.ReadLine();


        }
    }
}

This doesn't work. I THINK it's because the RegEx is "greedy" - this matches if there are any blank cells because I haven't explicitly said "catch everything EXCEPT a pipe character". A quick google and I see I can do that using [^\|] in the pattern.

So, if I change the pattern to:

 ".*[^\|]\|.*[^\|]\|.*[^\|]\|\|.*"

Why doesn't this work either?

Guess I'm a little confused, any pointers would be much appreciated.

Thanks!

Was it helpful?

Solution 2

This appears to work on regexpal:

^[^|]*\|[^|]*\|[^|]*\|\|.*
  • ^ alone means start of line
  • [^|] any character except |
  • [^|]* match zero or more non | characters
  • + may be wrong for your usage but it means at least one and however many more it finds
  • .* means anything at all and as many of them as it can find.

test data:

  • abc|123|234||673
  • abc|def||123|456
  • abc|123|234|673||ab

OTHER TIPS

Do you really need regex here?

var lines = File.ReadLines(filename)
           .Where(line => !String.IsNullOrWhiteSpace(line.Split('|')[3]));

File.WriteAllLines(outfile, lines);

.*[^\|] means zero or more wild-cards (.*) and one character that isn't a | ([^\|]).

Also, you need to escape | inside [].

And Regex.Match doesn't actually match, it searches, so you need ^ at the start of the regex (which indicates the start of string).

And the trailing .* is thus also not required.

You instead want zero or more characters that aren't |, like this:

"^[^|]*\|[^|]*\|[^|]*\|\|"

Test.

Why ".*\|.*\|.*\|\|.*" didn't work:

Apart from the above reasons...

* being greedy doesn't really change much (you can make it non-greedy / lazy by doing .*?). The problem is that . also matches | and it backtracks, so .* will include as many or as few |'s as required for it to match the string (yes, it will try to include more because it's greedy, but this doesn't change whether or not it finds something, only what it finds).

You can hack something together using lazy matching and possessive quantifiers, but it will end up being somewhat more complex and, more importantly, I suppose, C# doesn't support those.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top