Python - Web Scraping - BeautifulSoup

https://stackoverflow.com/questions/23446975

14-07-2023
|

Question

I am new to BeautifulSoup and trying to extract data from the following website: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city

I am trying to extract out the summary percentages for each of the categories (food, housing, clothes, transportation, personal care, and entertainment). So for the link provided above, I would like to extract out the percentages: 48%, 129%, 63%, 43%, 42%, 42%, and 72%.

Unfortunately my current Python code using BeautifulSoup extracts out the following percentages: 12%, 85%, 63%, 21%, 42%, and 48%. I do not know why this is the case. Any help here would be greatly appreciated! Here is my code:

import urllib2
from bs4 import BeautifulSoup
url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
page  = urllib2.urlopen(url)
soup_expatistan = BeautifulSoup(page)
page.close()

expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")

for expatistan_title in expatistan_titles:
    published_date = expatistan_title.find("th",class_="percent")
    print(published_date.span.string)

Solution

I couldn't identify the exact cause, but it seems a problem related to urllib2. Simply changing to requests, it started to work. Here is the code:

import requests
from bs4 import BeautifulSoup

url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
page  = requests.get(url).text
soup_expatistan = BeautifulSoup(page)

expatistan_table = soup_expatistan.find("table", class_="comparison")
expatistan_titles = expatistan_table.find_all("tr", class_="expandable")

for expatistan_title in expatistan_titles:
    published_date = expatistan_title.find("th", class_="percent")
    print(published_date.span.string)

You can use pip in order to install requests:

$ pip install requests

EDIT

The problem is indeed related to urllib2. It seems that www.expatistan.com server responds differently according to the User-Agent set in the request. In order to get the same response with urllib2, you have to do the following:

url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
request = urllib2.Request(url)
opener = urllib2.build_opener()
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0')
page = opener.open(request).read()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow