I am attempting to extract some columns from http://www.immihelp.com/h1b-sponsoring-companies-database/display-2-2010.html in a csv sheet.

from bs4 import  BeautifulSoup
import urllib2
import csv

f = csv.writer(open("H1B_apps.csv", "w"))
f.writerow(["Name", "Jobs", "Positions", "Wage", "City", "State", "Zip"]) # Write column headers as the first line

for x in range (2,5):

    soup = BeautifulSoup(urllib2.urlopen('http://www.immihelp.com/h1b-sponsoring-companies-database/display-'+str(x)+'-2010.html').read())

    table = soup.find('table', cellspacing = '1')

    rows = table.findAll('tr')



    for tr in rows:
        cols = tr.findAll('nobr')
        for data in cols:
            name = cols[0].findAll(text=True)
            jobs = cols[1].findAll(text=True)
            position = cols[2].findAll(text=True)
            wage = cols[3].findAll(text=True)
            city = cols[4].findAll(text=True)
            state = cols[5].findAll(text=True)
            zip = cols[6].findAll(text=True)

            print(name,jobs,position,wage,city,state,zip)
            f.writerow([name,jobs,position,wage,city,state,zip])

The code seems to be generally working well. However I have the following problems:

  1. the output keeps repeating itself 7 times (something wrong with my for loop, but can't figure it out?)
  2. Output text comes ['u TEXT'] - I just want the text bit.

Here is a sample of the output:

([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER SUPPORT SPECIALISTS'], [u'43139.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'55994.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873']) ([u'22ND CENTURY TECHNOLOGIES, INC'], [u'1'], [u'COMPUTER PROGRAMMERS'], [u'67995.0/Year'], [u'SOMERSET'], [u'NJ'], [u'08873'])

Any help would be appreciated. Thank you

有帮助吗?

解决方案

You don't need to loop through data in cols, as you're accessing them directly with [0],[1],[2]. Delete the for data in cols: line, and you'll stop it doing everything 7 times.

Also, the findAll will always return a list, so do name = cols[0].findAll(text=True)[0] to get each element on its own.a

However, some lines have empty fields. If you try and get an empty field with findAll, it returns an empty list [], not [''], so you can't access it with [0].

Since getting a field, checking if it's empty, and returning the result is a common thing that you're doing a whole bunch of times, a simple way to do it is with a simple helper function:

def getcol(cols, index, default=None):
    try:
        return cols[index].findAll(text=True)[0]
    except IndexError:
        return default

which you can then use in the for loop with name = getcol(cols, 0), for instance.

Also, some lines are coming in empty too, so we need to take that into account too.

Just so you know, the source of those immihelp pages has this copyright notice:

immihelp.com reserves all of our rights, including but not limited to any and all copyrights, trademarks, patents, trade secrets, and any other proprietary right that we may have in our web site, its content, and the goods and services that may be provided. The use of our rights and property requires our prior written consent. We are not providing you with any implied or express licenses or rights by making services available to you and you will have no rights to make any commercial uses of our web site or service without our prior written consent.

Contents of this webpage can't be seen as they are not meant to be viewed or copied.

Any violator will be prosecuted to the full extent of law and may face civil and criminal charges and huge monetary fines. You are warned! Beware!

They're a bit silly to think that the 'contents of this webpage can't be seen', as, quite patently, they can (your web browser couldn't display it if it couldn't be). But they have gone out of their way to make it a bit harder, and so using their data without consent is probably something they can sue for.

Whether or not it's illegal is up to how much you pay the lawyers, as usual.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top