Question

Greetings glorious comrades,

Once again I need to maximize my solution power by borrowing your brilliance.

I have to use powershell to iterate through a directory of massive text files (20 GB in some cases), extracting file names, rowcounts, and creation dates, then outputting that information into a csv.

Here is my code so far:

$fileEntries = [IO.Directory]::GetFiles("T:\frg\working"); 
foreach($fileName in $fileEntries) 
{ 
    $count = 0
    $filedate = (Get-Date).Date
    $reader = New-Object IO.StreamReader $filename
    while($reader.ReadLine() -ne $null){$count++}
    $reader.close()
    #Get-Content $filename | %{$lines++}
    [Console]::Writeline($filename+" "+$count+" "+ $filedate);

}

The get-date is just a temporary filler until I can get the file creation date.

It currently outputs similar to:

T:\frg\working\file1.txt 90055 03/06/2014 00:00:00
T:\frg\working\file2.txt 6419616 03/06/2014 00:00:00

But for the life of me I can't pipe this to a csv successfully.

I tried setting up an object with custom attributes and outputting to that, but it said that the pipe was empty.

The immense size of the files prevents use of the Import-csv option (importing 20GB into memory causes some issues). It would also be neat if I could filter by extension, but I can work around it if not.

Any pointers would be appreciated, thank you in advance.

Was it helpful?

Solution

Try this:

$fileEntries = [IO.Directory]::GetFiles("T:\frg\working")

$RecordCounts = 
  foreach($fileName in $fileEntries) 
   { 
    $count = 0
    $filedate = (Get-Date).Date
    Get-Content $fileName -ReadCount 1000 |
     foreach {$count += $_.count}

   New-Object psobject -Property @{FileName = $fileName;Count = $count;FileDate = $filedate}
 }

 $RecordCounts | Export-Csv c:\somedir\RecordCounts.csv

Edit: Testing the 3 posted solutions against a 1GB file of a little over 12 million lines:

$testfile = 'c:\testfiles\bigfile.txt'

'Get-Content | Measure-Object'
(measure-command {
Get-Content $testfile |
  Measure-Object -Line | select -expand Lines 
}).TotalSeconds
''

'StreamReader'
(measure-command {
$count=0
$reader = New-Object IO.StreamReader $testfile
while($reader.ReadLine() -ne $null){$count++}
$reader.close()
}).TotalSeconds
''

'Get-Content -ReadCount'
(measure-command {
$count=0
Get-Content $testfile -ReadCount 1000 |
  foreach {$count += $_.count}
}).TotalSeconds



Get-Content | Measure-Object
175.0600678

StreamReader
20.3832785

Get-Content -ReadCount
6.0199737

OTHER TIPS

This is how I would do it:

gci *.txt | % { 
    $lineCount = gc $_ | Measure-Object -Line | select -expand Lines
    select -InputObject $_ CreationTime, Name, @{Name="LineCount"; Expression={$lineCount}} 
    } | ConvertTo-Csv

From my testing, the gc on the file doesn't appear to load the entire file into memory (it must use some pipelining magic), so there is probably no need to write your own line counter.

I tested that in PS3. One of the text files was 13GB.

This is almost all your code for this one here... I added $Output as an empty array outside the ForEach loop. Then inside the loop I create a custom object with the values you specified, and add that object to the array for each pass. Finally I use Export-CSV at the end after the ForEach loop to save $Output to a CSV file for you.

$fileEntries = [IO.Directory]::GetFiles("T:\frg\working"); 
$OutPut = @()
foreach($fileName in $fileEntries) 
{ 
    $count = 0
    $filedate = (Get-Date).Date
    $reader = New-Object IO.StreamReader $filename
    while($reader.ReadLine() -ne $null){$count++}
    $reader.close()
    #Get-Content $filename | %{$lines++}
    [Console]::Writeline($filename+" "+$count+" "+ $filedate);
    $Current = New-Object -TypeName PSObject -Property @{
        FileName = $filename
        Count = $count
        FileDate = $filedate
    }
    $Output += $Current
}
$Output|Export-CSV C:\SomeFile.csv -notypeinfo
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top