A simple spider question

https://stackoverflow.com/questions/1810652

05-07-2019
|

Question

I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you.

I want to

start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attorneys/List.aspx?LastName=A
From LastName=A to extract links to actual bios: /BioLinks/
visit each of the /BioLinks/ to extract the school info for each attorney.

I am able to extract the /BioLinks/ and School information but I am unable to go from the initial url to the bio pages.

If you think this is the wrong way to go about this, then, how would you achieve this goal?

Many thanks.

Solution

Not sure I fully understand what you're asking, but maybe you need to get the absolute URL to each bio and retrieve the source code for that page:

import urllib2
bio_page = urllib.urlopen(bio_url).read()

Then use a regular expressions or other parsing to get the attorney's law school.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow