How do I extract content from a webpage with certain headers in bash?

https://stackoverflow.com/questions/20776562

21-09-2022
|

Question

So far I am using curl along w3m and sed to extract portions of a webpage like <body>....content....</body>. I want to ignore all the other headers (ex. <a></a>, <div></div>). Except the way I am doing it right now is really slow.

curl -L "http://www.somewebpage.com" | sed -n -e '\:<article class=:,\:<div id="below">: p' > file.html 
w3m -dump file.html > file2.txt

These two lines above are really slow because curl was to first save the whole webpage into a file and phrase it, then w3m phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx or hmtl2text that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:

<title>blah......blah...</title>
            <body>
                 Some text I need to extract
            </body>
 more stuffs

Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram <body></body> www.badexample.com and it would extract the content only in those headers?

Solution

You can use Perl's one liner for this:

perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title

Instead of the html tag, you can pass the whole regex as well:

perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "<body>(.*?)</body>"

OTHER TIPS

Must it be in bash? What about PHP and DOMDocument()?

$dom = new DOMDocument();
$new_dom = new DOMDocument();

$url_value = 'http://www.google.com';
$html = file_get_contents($url_value);
$dom->loadHTML($html);

$body = $dom->getElementsByTagName('body')->item(0);

foreach ($body->childNodes as $child){
  $new_dom->appendChild($new_dom->importNode($child, true));
}

echo $new_dom->saveHTML();

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow