Question

I am working on a Perl script to read CSV file and do some calculations. CSV file has only two columns, something like below.

One Two
1.00 44.000
3.00 55.000

Now this CSV file is very big ,can be from 10 MB to 2GB.

Currently I am taking CSV file of size 700 MB. I tried to open this file in notepad, excel but it looks like no software is going to open it.

I want to read may be last 1000 lines from CSV file and see the values. How can I do that? I cannot open file in notepad or any other program.

If I write a Perl script then I need to process complete file to go to end of file and then read last 1000 lines.

Is there any better way to that? I am new to Perl and any suggestions will be appreciated.

I have searched net and there are some scripts available like File::Tail but I don't know they will work on windows ?

Was it helpful?

Solution

In *nix, you can use the tail command.

tail -1000 yourfile | perl ...

That will write only the last 1000 lines to the perl program.

On Windows, there are gnuwin32 and unxutils packages both have tail utility.

OTHER TIPS

The File::ReadBackwards module allows you to read a file in reverse order. This makes it easy to get the last N lines as long as you aren't order dependent. If you are and the needed data is small enough (which it should be in your case) you could read the last 1000 lines into an array and then reverse it.

This is only tangentially related to your main question, but when you want to check if a module such as File::Tail works on your platform, check the results from CPAN Testers. The links at the top of the module page in CPAN Search lead you to

file-tail-header

Looking at the matrix, you see that indeed this module has a problem on Windows on all version of Perl tested:

file-tail-matrix

I've wrote quick backward file search using the following code on pure Perl:

#!/usr/bin/perl 
use warnings;
use strict;
my ($file, $num_of_lines) = @ARGV;

my $count = 0;
my $filesize = -s $file; # filesize used to control reaching the start of file while reading it backward
my $offset = -2; # skip two last characters: \n and ^Z in the end of file

open F, $file or die "Can't read $file: $!\n";

while (abs($offset) < $filesize) {
    my $line = "";
    # we need to check the start of the file for seek in mode "2" 
    # as it continues to output data in revers order even when out of file range reached
    while (abs($offset) < $filesize) {
        seek F, $offset, 2;     # because of negative $offset & "2" - it will seek backward
        $offset -= 1;           # move back the counter
        my $char = getc F;
        last if $char eq "\n"; # catch the whole line if reached
        $line = $char . $line; # otherwise we have next character for current line
    }

    # got the next line!
    print $line, "\n";

    # exit the loop if we are done
    $count++;
    last if $count > $num_of_lines;
}

and run this script like:

$ get-x-lines-from-end.pl ./myhugefile.log 200

Without tail, a Perl-only solution isn't that unreasonable.

One way is to seek from the end of the file, then read lines from it. If you don't have enough lines, seek even further from the end and try again.

sub last_x_lines {
    my ($filename, $lineswanted) = @_;
    my ($line, $filesize, $seekpos, $numread, @lines);

    open F, $filename or die "Can't read $filename: $!\n";

    $filesize = -s $filename;
    $seekpos = 50 * $lineswanted;
    $numread = 0;

    while ($numread < $lineswanted) {
        @lines = ();
        $numread = 0;
        seek(F, $filesize - $seekpos, 0);
        <F> if $seekpos < $filesize; # Discard probably fragmentary line
        while (defined($line = <F>)) {
            push @lines, $line;
            shift @lines if ++$numread > $lineswanted;
        }
        if ($numread < $lineswanted) {
            # We didn't get enough lines. Double the amount of space to read from next time.
            if ($seekpos >= $filesize) {
                die "There aren't even $lineswanted lines in $filename - I got $numread\n";
            }
            $seekpos *= 2;
            $seekpos = $filesize if $seekpos >= $filesize;
        }
    }
    close F;
    return @lines;
}

P.S. A better title would be something like "Reading lines from the end of a large file in Perl".

perl -n -e "shift @d if (@d >= 1000); push(@d, $_); END { print @d }" < bigfile.csv

Although really, the fact that UNIX systems can simply tail -n 1000 should convince you to simply install cygwin or colinux

You could use Tie::File module I believe. It looks like this loads the lines into an array, then you could get the size of the array and process arrayS-ze-1000 up to arraySize-1.

Tie::File

Another Option would be to count the number of lines in the file, then loop through the file once, and start reading in values at numberofLines-1000

$count = `wc -l < $file`;
die "wc failed: $?" if $?;
chomp($count);

That would give you number of lines (on most systems.

If you know the number of lines in the file, you can do

perl -ne "print if ($. > N);" filename.csv

where N is $num_lines_in_file - $num_lines_to_print. You can count the lines with

perl -e "while (<>) {} print $.;" filename.csv

The modules are the way to go. However, sometimes you may be writing a piece of code that you want to run on a variety of machines that may be missing the more obscure CPAN modules. In that case why not just 'tail' and dump the output to a temp file from within Perl?

#!/usr/bin/perl

`tail --lines=1000 /path/myfile.txt > tempfile.txt`

You then have something that isn't dependent on a CPAN module if installing one may present an issue.

Without relying on tail, which I probably would do, if you have more than $FILESIZE [2GB?] of memory then I'd just be lazy and do:

my @lines = <>;
my @lastKlines = @lines[-1000,-1];

Though the other answers involving tail or seek() are pretty much the way to go on this.

You should absolutely use File::Tail, or better yet another module. It's not a script, it's a module (programming library). It likely works on Windows. As somebody said, you can check this on CPAN Testers, or often just by reading the module documentation or just trying it.

You selected usage of the tail utility as your preferred answer, but that's likely to be more of a headache on Windows than File::Tail.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top