Question

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'

I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.

Was it helpful?

Solution

I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.

For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:

star-box-giga-star[^>]*>([^<]*)<

It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.

OTHER TIPS

Processing RSS feed is more comfortable.

As of the time of posting, the only RSS feeds available on the site are:

  • Born on this Date
  • Died on this Date
  • Daily Poll

Yet, you may make a call for adding a new one by getting in touch with the help desk.

Resources on RSS feed processing:

When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.

Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).

Here's IMDB's statement:

You may not use data mining, robots, screen scraping, or similar online data gathering and extraction tools on our website.

To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.

Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).

All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.

But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.

So a very simple command line program for getting a imdb rating would be...

program imdbrating;
{$apptype console}
uses htmlutils;

function ExtractJsonParm(parm,h:string):string;
 var r:integer;
 begin
  r:=pos('"'+Parm+'":',h);
  if r<>0 then 
    result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
  else
    result:='N/A';
 end;
    
var h:string;
begin
  h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
  writeln(ExtractJsonParm('Rating',h));
end.

If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.

Resource:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top