Domanda

I am trying to split a huge text file (~500 million lines of text) which is pretty regular and looks like this:

-- Start ---

blah blah

-- End --

-- Start --

blah blah

-- End --

...

where ... implies a repeating pattern and "blah blah" is of variable length ~ 2000 lines. I want to split off the first

-- Start --

blah blah

-- End --

block into a separate file and delete it from the original file in the FASTEST (runtime, given I will run this MANY times) possible way.

The ideal solution would cut the initial block from the original file and paste it into the new file without loading the tail of the huge initial file.

I attempted csplit in the following way:

csplit file.txt /End/+1 

which is a valid way of doing this, but not very efficient in time.

EDIT: Is there a solution if we remove the last "start-end" block from file instead of the first one?

È stato utile?

Soluzione

If you want the beginning removed from the original file, you have no choice but to read and write the whole rest of the file. To remove the end (as you suggest in your edit) it can be much more efficient:

use File::ReadBackwards;
use File::Slurp 'write_file';
my $fh = File::ReadBackwards->new( 'inputfile', "-- End --\n" )
    or die "couldn't read inputfile: $!\n";
my $last_chunk = $fh->readline
    or die "file was empty\n";
my $position = $fh->tell;
$fh->close;
truncate( 'inputfile', $position );
write_file( 'lastchunk', $last_chunk );

Altri suggerimenti

Perhaps something like the following will help you:

Split the file after every -- End -- marker. Create new files with a simple incremented suffix.

use strict;
use warnings;
use autodie;

my $file = shift;

my $i = 0;
my $fh;

open my $infh, '<', $file;

while (<$infh>) {
    open $fh, '>', $file . '.' . ++$i if !$fh;
    print $fh $_;
    undef $fh if /^-- END --/;
}

Unfortunately, there is no truncate equivalent for removing data from the beginning of a file.

If you really wanted to do this in stages, then I would suggest that you simply tell the last place you read from, so you can seek when you're ready to output another file.

You could use the flip-flop Operator to get the content between this Pattern:

use File::Slurp;
my @text = read_file( 'filename' ) ;
foreach my $line (@text){
  if ($line =~ /Start/ .. /End/) {
    # do stuff with $line
    print $line; # or so
  }
}

When your file is large, be carefull with slurping the whole file at once!

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top