Pregunta

I have to work through a large file (several MB) and remove comments from it that are marked by a time. An example :

blablabla  12:10:40 I want to remove this
blablabla some more
even more bla

After filtering, I would like it to look like this :

blablabla
blablabla some more
even more bla

The nicest way to do it should be easing a Regex :

Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);

Now this works perfectly for my purposes, but it's a bit slow.. I'm assuming this is because the first two characters [012] and [0123456789] match with a lot of the data (it's an ASCII file containing hexadecimal data, so like "0045ab0123" etc..). So Regex is having a match on the first two characters way too often.

When I change the Regex to

Dataout = Regex.Replace(Datain, ":[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);

It get's an enormous speedup, probably because there's not many ':' in the file at all. Good! But I still need to check the two characters before the first ':' being numbers and then delete the rest of the line.

So my question boils down to :

  • how can I make Regex first search for least frequent occurences of ':' and only after having found a match, checking the two characters before that?

Or maybe there's even a better way?

¿Fue útil?

Solución

You could use both of the regular expressions in the question. First a match with the leading colon expression to quickly find or exclude possible lines. If that succeeds then use the full replace expression.

MatchCollection mc = Regex.Matches(Datain, ":[012345][0123456789]:[012345][0123456789].*"));

if ( mc != null && mc.Length > 0 )
{
    Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);
}
else
{
    Dataout = Datain;
}

A variation might be

Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");
Regex changer = new regex("[012][0123456789]:[012345][0123456789]:[012345][0123456789].*");

if ( finder.Match(Datain).Success)
{
    Dataout = changer.Replace(Datain, string.Empty);
}
else
{
    Dataout = Datain;
}

Another variation would be to use the finder as above. If the string is found then just check whether the previous two characters are digits.

Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");

Match m = finder.Match(Datain);
if ( m.Success && m.Index > 1)
{
    if ( char.IsDigit(DataIn[m.index-1]) && char.IsDigit(DataIn[m.index-2])
    {
        Dataout = m.Index-2 == 0 ? string.Empty : DataIn.Substring(0, m.Index-2);
    }
    else
    {
        Dataout = Datain;
    }
}
else
{
    Dataout = Datain;
}

In the second and third ideas the finder and changer should be declared and given values before reading any lines. There is no need to execute the new Regex(...) inside the line reading loop.

Otros consejos

You could use DateTime.TryParseExact to check whether or not a word is a time and take all words before. Here's a LINQ query to clean all lines from the path, maybe it's more efficient:

string format = "HH:mm:ss";
DateTime time;
var cleanedLines = File.ReadLines(path)
    .Select(l => string.Join(" ", l.Split().TakeWhile(w => w.Length != format.Length
       ||  !DateTime.TryParseExact(w, format, CultureInfo.InvariantCulture, DateTimeStyles.None, out time))));

If performance is very critical you could also create a specialized method that is optimized for this task. Here is one approach that should be much more efficient:

public static string SubstringBeforeTime(string input, string timeFormat = "HH:mm:ss")
{
    if (string.IsNullOrWhiteSpace(input))
        return input;
    DateTime time;

    if (input.Length == timeFormat.Length && DateTime.TryParseExact(input, timeFormat, CultureInfo.InvariantCulture, DateTimeStyles.None, out time))
    {
        return ""; // full text is time
    }
    char[] wordSeparator = {' ', '\t'};
    int lastIndex = 0;
    int spaceIndex = input.IndexOfAny(wordSeparator);
    if(spaceIndex == -1)
        return input;
    char[] chars = input.ToCharArray();
    while(spaceIndex >= 0)
    {
        int nonSpaceIndex = Array.FindIndex<char>(chars, spaceIndex + 1, x => !wordSeparator.Contains(x));
        if(nonSpaceIndex == -1)
            return input;
        string nextWord = input.Substring(lastIndex, spaceIndex - lastIndex);
        if( nextWord.Length == timeFormat.Length 
         && DateTime.TryParseExact(nextWord, timeFormat, CultureInfo.InvariantCulture, DateTimeStyles.None, out time))
        {
            return input.Substring(0, lastIndex);
        }
        lastIndex = nonSpaceIndex;
        spaceIndex = input.IndexOfAny(wordSeparator, nonSpaceIndex + 1);
    }
    return input;
}

Sample data and test:

string[] lines = { "blablabla  12:10:40 I want to remove this", "blablabla some more", "even more bla  ", "14:22:11" };
foreach(string line in lines)
{
    string newLine = SubstringBeforeTime(line, "HH:mm:ss");
    Console.WriteLine(string.IsNullOrEmpty(newLine) ? "<empty>" : newLine);
}

Output:

blablabla  
blablabla some more
even more bla  
<empty>

in the end I went for this :

        bool MeerCCOl = true;
        int startpositie = 0;
        int CCOLfound = 0; // aantal keer dat terminal output is gevonden

        while(MeerCCOl)
        {
            Regex rgx = new Regex(":[0-5][0-9]:[0-5][0-9]", RegexOptions.Multiline | RegexOptions.Compiled);
            Match GevondenColon = rgx.Match(VlogDataGefilterd,startpositie);

            MeerCCOl = GevondenColon.Success; // CCOL terminal data gevonden, er is misschien nog meer..

            if (MeerCCOl && GevondenColon.Index >= 2)
            {
                CCOLfound++;
                int GevondenUur = 10 * (VlogDataGefilterd[GevondenColon.Index - 2] - '0') +
                                        VlogDataGefilterd[GevondenColon.Index - 1] - '0';
                if (VlogDataGefilterd[GevondenColon.Index - 2] >= '0' && VlogDataGefilterd[GevondenColon.Index - 2] <= '2' &&
                    VlogDataGefilterd[GevondenColon.Index - 1] >= '0' && VlogDataGefilterd[GevondenColon.Index - 1] <= '9' &&
                    GevondenUur>=0 && GevondenUur<=23)
                {
                    Regex rgx2 = new Regex("[012][0-9]:[0-5][0-9]:[0-5][0-9].*", RegexOptions.Multiline);
                    VlogDataGefilterd = rgx2.Replace(VlogDataGefilterd, string.Empty, 1, (GevondenColon.Index - 2));
                    startpositie = GevondenColon.Index - 2; // start volgende match vanaf de plek waar we de 
                }
            }
        }

It first searches for a match to :xx:xx and then checks the 2 characters before that. If it is recognized as a time it removes the whole thing. Bonus is that by check the hours separately, i can make sure the hours read 00-23, instead of 00-29. Also the number of matches is counted this way.

The original simple regex took about 550ms. This code (while more messy) takes about 12ms for the same datafile. That's a whopping 40x speedup :-)

Thanks all!

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top