Question

I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.

Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city

Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!

  1. The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
  2. When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
  3. (related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?

CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%

PREFERRED CSV FORMAT:

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42

Here is my current code:

import requests
import csv
from bs4 import BeautifulSoup

#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'

i=0
while i<len(Textfilelistsplit):
    url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
    page  = requests.get(url).text
    soup_expatistan = BeautifulSoup(page)

    #Prepare CSV writer.
    WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
    WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])

    expatistan_table = soup_expatistan.find("table",class_="comparison")
    expatistan_titles = expatistan_table.find_all("tr",class_="expandable")

    for expatistan_title in expatistan_titles:
            percent_difference = expatistan_title.find("th",class_="percent")
            percent_difference_title = percent_difference.span['class']
            if percent_difference_title == "expensiver":
                WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
            else:
                WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
    i+=1
Was it helpful?

Solution

Answers:

  • Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:

    if percent_difference_title == "expensiver" 
    

    with:

    if "expensiver" in percent_difference.span['class']
    
  • Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).

Other issues:

  • open csv file for writing before the loop
  • use with context managers while working with files
  • try to follow PEP8 style guide

Here's the code with modifications:

import requests
import csv
from bs4 import BeautifulSoup

BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'

with open('City.txt') as input_file:
    with open("Expatistan.csv", "w") as output_file:
        writer = csv.writer(output_file)
        writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
        for line in input_file:
            city = line.strip()
            url = BASE_URL.format(home_city=home_city, city=city)
            soup = BeautifulSoup(requests.get(url).text)

            table = soup.find("table", class_="comparison")
            differences = []
            for title in table.find_all("tr", class_="expandable"):
                percent_difference = title.find("th", class_="percent")
                if "expensiver" in percent_difference.span['class']:
                    differences.append('+' + percent_difference.span.string)
                else:
                    differences.append('-' + percent_difference.span.string)
            writer.writerow([city] + differences)

For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%

Make sure you understand what changes have I made. Let me know if you need further help.

OTHER TIPS

csv.writer.writerow() takes a sequence and makes each element a column; normally you'd give it a list with columns, but you are passing in strings instead; that'll add individual characters as columns instead.

Just build a list, then write it to the CSV file.

First, open the CSV file once, not for every separate city; you are clearing out the file every time you open it.

import requests
import csv
from bs4 import BeautifulSoup

HomeCity = 'Phoenix'

with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(["City", "Food", "Housing", "Clothes",
                     "Transportation", "Personal Care", "Entertainment"])

    for line in cities:
        city = line.strip()
        url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
            HomeCity, city)
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)

        titles = soup.select("table.comparison tr.expandable")

        row = [city]
        for title in titles:
            percent_difference = title.find("th", class_="percent")
            changeclass = percent_difference.span['class']
            change = percent_difference.span.string
            if "expensiver" in changeclass:
                change = '+' + change
            else:
                change = '-' + change
            row.append(change)
         writer.writerow(row)

So, first of all, one passes the writerow method an iterable, and each object in that iterable gets written with commas separating them. So if you give it a string, then each character gets separated:

WriteResultsFile.writerow('hello there')

writes

h,e,l,l,o, ,t,h,e,r,e

But

WriteResultsFile.writerow(['hello', 'there'])

writes

hello,there

That's why you are getting results like

n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%

The rest of your problems are errors in your webscraping. First of all, when I scrape the site, searching for tables with CSS class "comparison" gives me None. So I had to use

expatistan_table = soup_expatistan.find("table","comparison")

Now, the reason your "if statement is broken" is because

percent_difference.span['class']

returns a list. If we modify that to

percent_difference.span['class'][0]

things will work the way you expect.

Now, your real issue is that inside the innermost loop you are finding the % changing in price for the individual items. You want these as items in your row of price differences, not individual rows. So, I declare an empty list items to which I append percent_difference.span.string, and then write the row outside the innermost loop Like so:

items = []
for expatistan_title in expatistan_titles:
        percent_difference = expatistan_title.find("th","percent")
        percent_difference_title = percent_difference.span["class"][0]
        print percent_difference_title
        if percent_difference_title == "expensiver":
            items.append('+' + percent_difference.span.string)
        else:
            items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)

The final error, is the in the while loop you re-open the csv file, and overwrite everything so you only have the final city in the end. Accounting for all theses errors (many of which you should have been able to find without help) leaves us with:

#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))

i=0
while i<len(Textfilelistsplit):
    url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
    page  = requests.get(url).text
    print url
    soup_expatistan = BeautifulSoup(page)

    WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])

    expatistan_table = soup_expatistan.find("table","comparison")
    expatistan_titles = expatistan_table.find_all("tr","expandable")

    items = []
    for expatistan_title in expatistan_titles:
            percent_difference = expatistan_title.find("th","percent")
            percent_difference_title = percent_difference.span["class"][0]
            print percent_difference_title
            if percent_difference_title == "expensiver":
                items.append('+' + percent_difference.span.string)
            else:
                items.append('-' + percent_difference.span.string)
    row = [Textfilelistsplit[i]]
    row.extend(items)
    WriteResultsFile.writerow(row)
    i+=1

YAA - Yet Another Answer.

Unlike the other answers, this treats the data as a series key-value pairs; ie: a list of dictionaries, which are then written to CSV. A list of wanted fields is provided to the csv writer (DictWriter), which discards additional information (beyond the specified fields) and blanks missing information. Also, should the order of the information on the original page change, this solution is unaffected.

I also assume you are going to open the CSV file in something like Excel. Additional parameters need to be given to the csv writer for this to happen nicely (see dialect parameter). Given that we are not sanitising the returned data, we should explicitly delimit it with unconditional quoting (see quoting parameter).

import csv
import requests

from bs4 import BeautifulSoup

#Read text file
with open("City.txt") as cities_h:
    cities = cities_h.readlines()
home_city = "Phoenix"

city_data = []
for city in cities:
    url    = "http://www.expatistan.com/cost-of-living/comparison/%s/%s" % (home_city, city)
    resp   = requests.get(url)
    soup   = BeautifulSoup(resp.content, from_encoding = resp.encoding)
    titles = soup.select("table.comparison tr.expandable")
    if titles:
        data = {}
        for title in titles:
            name = title.find("th", class_ = "clickable")
            diff = title.find("th", class_ = "percent")
            exp  = bool(diff.find("span", class_ = "expensiver"))
            data[name.text] = ("+" if exp else "-") + diff.span.text
        data["City"] = soup.find("strong", class_ = "city-2").text
        city_data.append(data)

with open("Expatistan.csv","w") as csv_h:
    fields = \
    [
        "City",
        "Food",
        "Housing",
        "Clothes",
        "Transportation",
        "Personal Care",
        "Entertainment"
    ]

    #Prepare CSV writer.
    writer = csv.DictWriter\
    (
        csv_h,
        fields,
        quoting        = csv.QUOTE_ALL,
        extrasaction   = "ignore",
        dialect        = "excel",
        lineterminator = "\n",
    )
    writer.writeheader()
    writer.writerows(city_data)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top