Pergunta

I need to create a script to search through just below a million files of text, code, etc. to find matches and then output all hits on a particular string pattern to a CSV file.

So far I made this;

$location = 'C:\Work*'

$arr = "foo", "bar" #Where "foo" and "bar" are string patterns I want to search for (separately)

for($i=0;$i -lt $arr.length; $i++) {
Get-ChildItem $location -recurse | select-string -pattern $($arr[$i]) | select-object Path | Export-Csv "C:\Work\Results\$($arr[$i]).txt"
}

This returns to me a CSV file named "foo.txt" with a list of all files with the word "foo" in it, and a file named "bar.txt" with a list of all files containing the word "bar".

Is there any way anyone can think of to optimize this script to make it work faster? Or ideas on how to make an entirely different, but equivalent script that just works faster?

All input appreciated!

Foi útil?

Solução

If your files are not huge and can be read into memory then this version should work quite faster (and my quick and dirty local test seems to prove that):

$location = 'C:\ROM'
$arr = "Roman", "Kuzmin"

# remove output files
foreach($test in $arr) {
    Remove-Item ".\$test.txt" -ErrorAction 0 -Confirm
}

Get-ChildItem $location -Recurse | .{process{ if (!$_.PSIsContainer) {
    # read all text once
    $content = [System.IO.File]::ReadAllText($_.FullName)
    # test patterns and output paths once
    foreach($test in $arr) {
        if ($content -match $test) {
            $_.FullName >> ".\$test.txt"
        }
    }
}}}

Notes: 1) mind changed paths and patterns in the example; 2) output files are not CSV but plain text; there is not much reason in CSV if you are interested just in paths - plain text files one path per line will do.

Outras dicas

Let's suppose that 1) the files are not too big and you can load it into memory, 2) you really just want the Path of the file, that matches (not the line etc.).

I tried to read the file only once and then iterate through the regexes. There is some gain (it's a faster then the original solution), but the final result will depend on other factors like file sizes, count of files etc.

Also removing 'ignorecase' makes it faster a little bit.

$res = @{}
$arr | % { $res[$_] = @() }

Get-ChildItem $location -recurse | 
  ? { !$_.PsIsContainer } |
  % { $file = $_
      $text = [Io.File]::ReadAllText($file.FullName)
      $arr | 
        % { $regex = $_
            if ([Regex]::IsMatch($text, $regex, 'ignorecase')) {
              $res[$regex] = $file.FullName
            }
        }
  }
$res.GetEnumerator() | % { 
  $_.Value | Export-Csv "d:\temp\so-res$($_.Key).txt"
}
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top