Scraping pages that do not seem to have URLs

https://stackoverflow.com/questions/19068521

29-06-2022
|

Question

I'm trying to scrape these listings and provide more exposure for these job listings on a site that belongs to a client of mine. The issue is that I need to be able to link to the specific job listing in order for the job seeker to apply. This is the page I'm trying to save listing links from.

It would be ideal if I could save an address for the job seeker to click on to see the original listing and then apply.

What is this website doing to not feature a URL for these pages
Is it possible to provide a listing specific address
If that's possible how could I generate that address?

If I can't get a specific address I think I could get it so that the user clicks a link that triggers an internal script on my client's site which takes the listing ID and searches the site I found that listing on, and then redirects the user to that specific listing.

The downside to this is that the user will have to wait a little while depending on how far back the listing is on a directory. I could put some kind of progress bar with a pleasant "Searching for your listing! Thanks for being patient" message.

If I can avoid having to do this, though, that'd be great!

I'm using Nokogiri and Mechanize.

Solution

The page you refer to appears to be generated by an Oracle product, so one would think they'd be willing to construct a web form properly (and with reference to accessibility concerns). They haven't, so it occurs to me that either their engineer was having a bad day, or they are deliberately making it (slightly) harder to scrape.

The reason your browser shows no href when you hover over those links is that there isn't one. What the page does instead is to use JavaScript to capture the click event, populate a POST form with some hidden values, and call the submit method programmatically. This can cause problems with screen-readers and other accessibility devices, as well as causing problems with the way in which back buttons have to re-submit the page.

The good news is that constructions of this kind can usually be scraped by creating a form yourself, either using a real one on a third party page, or via a crawler library. If you post the right values to the target URI, reverse-engineered from examining the page's script, the resulting document should be the "linked" page you expect.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow