HTML parsing with Perl LWP gives incorrect results

Question

Update

My apologies. After looking further I have found that IMDb uses the Accept-Language header of the HTTP request to determine how to render the page. By default LWP doesn't send this header at all, but Firefox does, which is why my solution above works correctly.

So a solution using only LWP is possible. A tailored request must first be built using an HTTP::Request object, and passed to a LWP::UserAgent object using the request method.

This code demonstrates.

use strict;
use warnings;

use feature 'say';

use LWP;
use HTML::TreeBuilder::XPath;

my $url = 'http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all';

my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => $url, ['Accept-Language' => 'en-gb,en']);
my $resp = $ua->request($req);

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my @results = $tree->findnodes_as_strings('//td[@class="result_text"]/a/text()');

say $results[0];

The output is as before.

Original Answer

The problem is that the content you are seeing in your browser is generated by JavaScript code after the page has loaded. The simple combination of LWP and HTML::TreeBuilder cannot process anything other than the raw HTML returned by the site.

The usual solution recommended for this is to use the WWW::Mechanize::Firefox module, which uses a live Firefox process to fetch the HTML and JavaScript and render the page. Note that it requires the Firefox browser to be installed on your machine, and the MozRepl Firefox addon must be installed and running.

This program shows working code that returns the result you expect. Note that I have also used HTML::TreeBuilder::XPath instead of the bare HTML::TreeBuilder which allows much simpler expression of the parts of the HTML you are interested in.

use strict;
use warnings;

use feature 'say';

use WWW::Mechanize::Firefox;
use HTML::TreeBuilder::XPath;

my $url = 'http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all';

my $mech = WWW::Mechanize::Firefox->new;
$mech->get($url);

my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->response->content);
my @results = $tree->findnodes_as_strings('//td[@class="result_text"]/a/text()');

say $results[0];

output

A Man, a Bear