Python extract search results from monster.com

https://stackoverflow.com/questions/9116778

21-04-2021
|

Question

I've seen the results for google extracting but it doesnt work for this. I would like to simply go into the code and change the parameters and when ran, it does the search and scrapes the job titles, locations, and date. This is what I have so far. Any help would be great and thanks in advance.

I would the script to execute a search on monster.com with the given params (engineer software CA) and scrape the results.

#! /usr/bin/python
import re
import requests
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

parameters = ["Software","Engineer","CA"]
base_url = "http://careers.boozallen.com/search?q="
search_string = "+".join(parameters)

final_url = base_url + search_string

a = requests.get(final_url)
raw_string = a.text.strip()


soup = BeautifulSoup( raw_string )

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' })

for job_url in job_urls:

    print job_url.text
    print

raw_input("Press enter to close: ")

I know this, below, works as a standard scrape.

handle = urlopen("http://jobsearch.monster.com/search/Engineer_5?q=Software&where=AZ&rad=20&sort=rv.di.dt")
responce = handle.read()
soup = BeautifulSoup( responce )

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' })
for job_url in job_urls:
    print job_url.text
    print

No correct solution

OTHER TIPS

If you point your browser at http://careers.boozallen.com/search?q=software+engineer+CA and inspect the HTML you'll see HTML like this:

<tr class="dbOutputRow2">
    <td style="width: 400px;" class="colTitle" headers="hdrTitle"><span class="jobTitle"><a href="http://careers.boozallen.com/job/San-Diego-Network-Engineer%2C-Senior-Job-CA-92101/1645793/">Network Engineer, Senior Job</a></span></td>
    <td style="width: auto;" class="colLocation" headers="hdrLocation"><span class="jobLocation">San Diego, CA, US</span></td>
    <td style="width: 155px;" class="colDate" headers="hdrDate" nowrap="nowrap"><span class="jobDate">Jan 5, 2012</span></td>

The information you are looking for are in <span> tags, with class attributes equal to jobTitle, jobLocation, or jobDate.

Here is how you could scrape these bits using lxml:

import urllib2
import lxml.html as LH

url = 'http://careers.boozallen.com/search?q=software+engineer+CA'
doc = LH.parse(urllib2.urlopen(url))

def text_content(iterable):
    for elt in iterable:
        yield elt.text_content()

data = text_content(doc.xpath('''//span[@class = "jobTitle"
                                        or @class = "jobLocation"
                                        or @class = "jobDate"]'''))

for title, location, date in zip(*[data]*3):
    print(title,location,date)

yields

('Title', 'Location', 'Date')
('Network Engineer, Senior Job', 'San Diego, CA, US', 'Jan 5, 2012')
('Network Integration Engineer, Mid Job', 'San Diego, CA, US', 'Jan 12, 2012')
('Systems Engineer, Senior Job', 'San Diego, CA, US', 'Jan 31, 2012')
('Enterprise Architect, Senior Job', 'Washington, DC, US', 'Jan 23, 2012')
...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow