How to write a simple spider in Python?

https://stackoverflow.com/questions/1805231

05-07-2019
|

Question

I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:

1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A

2) from initial url pick up these urls with this regex:

hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')

[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
....

3) Go to each of these urls and scrape the school info with this regex

hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'

[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em> , Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest grades in Comparative Constitutional History, Legal Drafting, Real Property and Sales, ', u'2007']

4) Write the scraped school info into schools.csv file

Can you help me write this spider in Python? I've been trying to write it in Scrapy but without success. See my previous question.

Thank you.

Solution

http://www.ibm.com/developerworks/linux/library/l-spider/ IBM article with good description

http://code.activestate.com/recipes/576551/ Python cookbook, better code but less explanation

OTHER TIPS

Also, I suggest you read:

RegEx match open tags except XHTML self-contained tags

Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.

EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup, which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow