Web scrape using Beautifulsoup , brings different content

https://stackoverflow.com/questions/22440820

15-06-2023
|

Domanda

If you visit http://www.imdb.com/title/tt2375692/episodes?season=1 here, then you will see that season 1,episode 1's publish date is 25 Jan. 2014,

This is the code I am using to scrape.

    req = urllib2.Request('http://www.imdb.com/title/tt2375692/episodes?season=1')
    self.diziPage = urllib2.urlopen(req).read()
    self.diziSoup = BeautifulSoup(self.diziPage,from_encoding="utf8")

After I scrape the site, everything is fine except the airdate, episode 1 's airdate comes out 20 April 2014, which is not in present when I visit, all of the rest informations comes corrent.

I thought it may be because of headers I did some experiments but that didnt work.

Soluzione 2

Seems like, imdb provides different air dates according to visitors location. This is why I m getting different data, I think they check visitor's ip or something.

Altri suggerimenti

I get 25 Jan. 2014 when I scrape the date using BeautifulSoup. First, find the link to the first episode I., then get the episode block by taking parent of the link parent, then find the date by class inside:

import urllib2
from bs4 import BeautifulSoup


url = "http://www.imdb.com/title/tt2375692/episodes?season=1"

soup = BeautifulSoup(urllib2.urlopen(url))

episode1 = soup.find('a', {'title': 'I.'}).parent.parent
print episode1.find('div', {'class': 'airdate'}).text.strip()

prints:

25 Jan. 2014

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow