How to collect corpus from a website using Perl

https://stackoverflow.com/questions/22327556

perl
cgi-bin

12-06-2023
|

Question

I am searching for a program that collects the all corpus from a website and writes it to a single text file. I have the following code right now

#!/usr/bin/perl

print "Content-Type: text/html; charset=utf-8\n\n";

use CGI;
use Cwd;
use strict;
$q=new CGI;
$a=$q->param('file');
chomp($a);
print "$a<br>";
my $ftpname="www.kuvempu.com";
system("wget --mirror -p --convert-links -x --reject=gif $ftpname");

But it only gives the .html files of the website. How can I extract only the text from those files and write it to a single text file?

Solution

You can do something like the following:

use strict;
use warnings;
use HTML::Strip;
use LWP::Simple qw/get/;


my $html = get shift or die "Unable to get web content.";
print HTML::Strip->new()->parse($html);

Command-line usage: perl script.pl http://www.website.com > outFile.txt

outFile.txt will contain the site's corpus.

Hope this helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow