Question

I am trying to scrape all of the careers pages from this website: http://wearemadeinny.com/find-a-job/

I tried the below, but unfortunately the hrefs only appear when you click into one of the company pages:

from lxml import html
import requests

page = requests.get("http://wearemadeinny.com/find-a-job/")
tree = lxml.html.fromstring(page.text)

jobs = tree.xpath('//*[@id="venue-hiring"]/a/@href')

links = [x for x in jobs]

print links

I noticed that each <li> contains html data attributes which contain the job page urls. So, is it possible to scrape the data-hiringurl attribute from each <li>. If not with lxml and XPath selectors are their other options?

This is one of the <li> elements that I would like to pull from. I would specifically like to pull the data-hiringurl="http://www.admeld.com/about/jobs/" The xpath to this element is //*[@id="v7"]

<li id="v7" data-vid="7" data-name="Admeld" data-address="230 Park Avenue South Suite 1201" data-lat="40.7378349" data-long="-73.9886703" data-url="http://www.admeld.com/" data-hiring="1" data-hiringurl="http://www.admeld.com/about/jobs/" data-whynyc="" data-category="1"><a href="#" class="list-digital">
<span class="venue-name">Admeld</span><br>
<span class="venue-address">230 Park Avenue South</span>
<br><span class="venue-hiring">We are hiring!</span>                                    
</a>
</li>
Was it helpful?

Solution

Searching for expected content by means of lxml

This assumes, you already have content of the page containing the data you need. The code shows fetching it by http request, if it requires rendering within browser, see later part of my answer how to get get it.

If you want to get all values in attribute data-hiringurl, try XPath //@data-hiringurl

from lxml import html
import requests

url = "http://wearemadeinny.com/find-a-job/"

page = requests.get(url)
tree = html.fromstring(page.text) # corrected, used to be `lxml.html.fromstring`

xp = "//@data-hiringurl"
job_urls = tree.xpath(xp)

print print job_urls

But I am not sure, if the url you have provided contain such data. I did not find it there.

Getting content of page rendered by JavaScript

If the page gets the content you are interested in rendered dynamically on the client, you need to provide the browser context and let it render there. Using selenium can do the work:

>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> url = "http://wearemadeinny.com/find-a-job/"
>>> browser.get(url)
>>> page = browser.page_source
>>> print page

Now you have in page variable content of the page and you may proceed with lxml as described above.

Note: I do not guarantee, you will get the expected content in the page, I only know, it comes in rendered form. But if you need to proceed by clicking on some of the elements on the page, filling in some text, pressing buttons, all that can be done by browser instance shown above - just read doc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top