You can do something like the following:
use strict;
use warnings;
use HTML::Strip;
use LWP::Simple qw/get/;
my $html = get shift or die "Unable to get web content.";
print HTML::Strip->new()->parse($html);
Command-line usage: perl script.pl http://www.website.com > outFile.txt
outFile.txt
will contain the site's corpus.
Hope this helps!