web scraping in imdb using R

https://stackoverflow.com/questions/22780363

25-06-2023
|

Question

I want to find the link to the top 250 movies in imdb. I decided to find a common pattern by viewing the HTML source code. I found "chttp" but I am not sure if it will get me anywhere. How can I find a pattern to construct the links upon it?

require("XML")
imdb="http://www.imdb.com/chart/top?sort=ir,desc"
imdb.page=readLines(imdb)
g = grep(pattern = "chttp", x = imdb_page) 
imdb.lines=imdb.page[g]

Here's an example output:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"

My main problem is trying to figure out the link(URL) for each of the 250 top movies based on the code I have already written. I basically don't know what's the next step. Also I am not sure the pattern I used the grep command for "chttp" is a good one at all or not.

So according to results starting from index 3 the movie titles are on the odd indices:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"
> imdb.lines[2]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0111161/?ref_=chttp_tt_1\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[3]
[1] "    <a href=\"/title/tt0111161/?ref_=chttp_tt_1\" title=\"Frank Darabont (dir.), Tim Robbins, Morgan Freeman\" >The Shawshank Redemption</a>"
> imdb.lines[6]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0071562/?ref_=chttp_tt_3\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[4]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0068646/?ref_=chttp_tt_2\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[5]
[1] "    <a href=\"/title/tt0068646/?ref_=chttp_tt_2\" title=\"Francis Ford Coppola (dir.), Marlon Brando, Al Pacino\" >The Godfather</a>"
> imdb.lines[7]
[1] "    <a href=\"/title/tt0071562/?ref_=chttp_tt_3\" title=\"Francis Ford Coppola (dir.), Al Pacino, Robert De Niro\" >The Godfather: Part II</a>"
> imdb.lines[9]
[1] "    <a href=\"/title/tt0468569/?ref_=chttp_tt_4\" title=\"Christopher Nolan (dir.), Christian Bale, Heath Ledger\" >The Dark Knight</a>"
> imdb.lines[10]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0110912/?ref_=chttp_tt_5\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BMjE0ODk2NjczOV5BMl5BanBnXkFtZTYwNDQ0NDg4._V1_SY50_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"

Solution

xpath makes jobs like this trivial.

library(XML)
tt <- htmlParse('http://www.imdb.com/chart/top?sort=ir,desc')
cbind(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlValue),
           t(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlAttrs)))

The first argument to cbind returns titles (the text between the a tags) and the second returns the anchors' attributes (href and title, the latter of which in this case contains details about the films' directors).

OTHER TIPS

What about using the alternative interfaces?

Edit #1: I have looked into some of the files and there don't seem to be any links or even the imdb ID, there should be another way though.

Edit #2: OK, there is no other way apparently, but somebody already did something. E.g. this guy; have a look.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow