Domanda

I am reading a flat file (fixed width fields) so that I can import it into SQLServer. This is a new project in my company. I have not yet worked on parsing huge files in the most time-efficient and quick way. Scouring on the Internet, I have found a powershell script which can parse a fixed-width file... It does the job of parsing the file.

[void][System.Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic') 

$Parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser( 
'C:\T1046436.dat') 

$Parser.TextFieldType = 'FixedWidth' 
$Parser.TrimWhiteSpace = $False 
$Parser.FieldWidths = @(13,6,11,2,10,10,1,1,7,7,2,17,1,2,2,4,1,10,10,10,10,2,10,10,11,2,2,1,1) 

while(!$Parser.EndOfData) 
{ 
try 
{ 
    Write-Host $Parser.ReadFields() 
} 
catch [Microsoft.VisualBasic.FileIO.MalformedLineException] 
{ 
    Write-Host "Error, line $($_.Exception.LineNumber): $($Parser.ErrorLine)" 
} 
} 

I want to be able to save this in a pipe-delimited file, so that I can simply BCP IN the sql database. Thats my quick-dirty way to load data. But, it is using lots of time (20 minutes for 50000 records).

Any suggestions to get it quickly/ efficiently:

  • converted into pipe-delimited file OR
  • directly import the flat file into SQL server from powershell... OR
  • I am open to use any other scripting language which can help parse the flat-file quickly into a pipe-delimited file (python, perl etc) under Windows. Any example scripts to use will be appreciated.
È stato utile?

Soluzione

This Powershell script will parse a collection of $files containing records of 5 fields, each having 10 characters, and output those records with the fields pipe-delimited.

#Create a regular expression to match the field widths and capture the data.
$regex = [regex]'(.{10})(.{10})(.{10})(.{10})(.{10})'

#create a filter to insert a pipe character between the captured groups.
filter PipeDelimit {$_ -replace $regex, '$1|$2|$3|$4|$5'}


#Pipe the records thorough the filter in batches of 1000 
Get-Content $files -ReadCount 1000 | Pipedelimit

You'll need to modify the regex and filter to match your data. I suspect it will take considerably less than 20 minutes to chew through half a million of those records.

The -Readcount will control memory usage by keeping only 1000 records at a time in the pipeline. They will be passed to the pipeline as an array, and the -replace operator in the filter will delimit the entire array in one operation, without needing to foreach through each record. The filter is admittedly unusual, and could be replaced with foreach-object, but the filter is marginally faster and it adds up if you're doing lots of reps.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top