Pergunta

I'm passing downloaded HTML to STDIN and then wiping all tags but the table markup. I want to render the tables based on the remaining instances of table, tr, and td so the tables end up "\t" or "|" delimited. ASCII formatted tables would also work. The following is what I have so far, but it doesn't get the job done:

#!/usr/bin/perl -ws
use HTML::Scrubber;
use HTML::Entities qw(decode_entities);
use Text::Unidecode qw(unidecode);

my $HTMLinput = do {local $/; <STDIN>};

my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );

#this prints the text from the page, but without formatting tables in ASCII:
#print $scrubber->scrub($HTMLinput);

my $scrubber2 = $scrubber->scrub($HTMLinput);

#was hoping this would remove transform table, tr, and td-tagged content
#into ASCII-formatted tables, but it doesn't work:
print unidecode(decode_entities($scrubber2)), "\n";

#test page: http://www.w3schools.com/html/html_tables.asp
#curl http://www.w3schools.com/html/html_tables.asp | html.table.parser.pl 
Foi útil?

Solução

Here is the solution I've arrived at, thanks in part to username tjd:

#!/usr/bin/perl -ws
use HTML::Scrubber;

my $HTMLinput = do {local $/; <STDIN>};
my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );
print $scrubber->scrub($HTMLinput);

#test page: http://www.w3schools.com/html/html_tables.asp
#links -dump http://www.w3schools.com/html/html_tables.asp | html.table.parser.pl 

#needed: "links" program for bash (sudo yum install links)
#http://www.jikos.cz/~mikulas/links/

Outras dicas

I'd hate to reinvent the wheel of creating tables in text. I'd either pipe the output of a text browser like links or w3m to a file/stdout or use a module like Text::Table to do the heavy lifting.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top