Question

For a schoolproject we need to scrape a 'job-finding' website and store this in a DB, and later match with these profiles with companies who are searching people.

On this particular site, all the url's to the pages I need to scrape are in 1 div (with 10 links per page) the div is called 'primaryResults' which has 10 in it.

With beautifulsoup I wish to first scrape all the links in an array by looping through the page number in the url until a 404 or something similar pops up.

Then go through each of these pages, and store the information I need from each page into an array and lastly send this to my DB.

Now I'm getting stuck at the part where I collect the 10 links from the ID = 'primaryResults' div.

How would I go and put this into my Python to make this store all the 10 url's into an array? So far I have tried this:

import urllib2
from BeautifulSoup import BeautifulSoup

opener = urllib2.build_opener()
opener.addheaders = [("User-Agent", "Mozilla/5.0")]

url = ("http://jobsearch.monsterboard.nl/browse/")

content = opener.open(url).read()
soup = BeautifulSoup(content)

soup.find(id="primaryResults")
print soup.find_all('a')

but this only gives an error:

Traceback (most recent call last):

print soup.find_all('a')
TypeError: 'NoneType' object is not callable

Could someone please help me out? Thanks :)

Was it helpful?

Solution

Here is the answer to get all the links that are in the URL that you have mentioned

from bs4 import BeautifulSoup
import urllib2
url="http://jobsearch.monsterboard.nl/browse/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
jobs=soup.findAll('a',{'class':'slJobTitle'})
for eachjob in jobs:
 print eachjob['href']  

Hope it is clear and helpful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top