Question

I am trying to scrape the HTML of http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all. The result set contains one single result, within the class result_text. So I enter the link, take the text within that link, which, in this case, as Firebug shows, is A Man, a Bear. But strangely, the following code prints out Yek mard, yek khers. Can anyone help me on how to get the text which I am seeing in the browser?

$name = "Yek mard, yek khers";
$uri = URI->new("http://www.imdb.com/find?q=".uri_escape($name)."&s=all");
my $response = $ua->get( $uri );

my $root = HTML::TreeBuilder->new_from_content($response->decoded_content);
@results = $root->find_by_attribute("class","result_text");
$link = $results[0]->find_by_tag_name("a");
say $link->as_HTML();
# This should print <a href="/title/tt0122857/?ref_=fn_al_tt_1">A Man, a Bear</a>
# but prints <a href="/title/tt0122857/?ref_=fn_al_tt_1">Yek mard, yek khers</a>
Was it helpful?

Solution

Update

My apologies. After looking further I have found that IMDb uses the Accept-Language header of the HTTP request to determine how to render the page. By default LWP doesn't send this header at all, but Firefox does, which is why my solution above works correctly.

So a solution using only LWP is possible. A tailored request must first be built using an HTTP::Request object, and passed to a LWP::UserAgent object using the request method.

This code demonstrates.

use strict;
use warnings;

use feature 'say';

use LWP;
use HTML::TreeBuilder::XPath;

my $url = 'http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all';

my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => $url, ['Accept-Language' => 'en-gb,en']);
my $resp = $ua->request($req);

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my @results = $tree->findnodes_as_strings('//td[@class="result_text"]/a/text()');

say $results[0];

The output is as before.


Original Answer

The problem is that the content you are seeing in your browser is generated by JavaScript code after the page has loaded. The simple combination of LWP and HTML::TreeBuilder cannot process anything other than the raw HTML returned by the site.

The usual solution recommended for this is to use the WWW::Mechanize::Firefox module, which uses a live Firefox process to fetch the HTML and JavaScript and render the page. Note that it requires the Firefox browser to be installed on your machine, and the MozRepl Firefox addon must be installed and running.

This program shows working code that returns the result you expect. Note that I have also used HTML::TreeBuilder::XPath instead of the bare HTML::TreeBuilder which allows much simpler expression of the parts of the HTML you are interested in.

use strict;
use warnings;

use feature 'say';

use WWW::Mechanize::Firefox;
use HTML::TreeBuilder::XPath;

my $url = 'http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all';

my $mech = WWW::Mechanize::Firefox->new;
$mech->get($url);

my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->response->content);
my @results = $tree->findnodes_as_strings('//td[@class="result_text"]/a/text()');

say $results[0];

output

A Man, a Bear
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top