How to write a simple spider in Python?
-
05-07-2019 - |
Question
I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:
1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A
2) from initial url pick up these urls with this regex:
hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
....
3) Go to each of these urls and scrape the school info with this regex
hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em>
, Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest
grades in Comparative Constitutional History, Legal Drafting, Real Property and
Sales, ', u'2007']
4) Write the scraped school info into schools.csv file
Can you help me write this spider in Python? I've been trying to write it in Scrapy but without success. See my previous question.
Thank you.
Solution
http://www.ibm.com/developerworks/linux/library/l-spider/ IBM article with good description
or
http://code.activestate.com/recipes/576551/ Python cookbook, better code but less explanation
OTHER TIPS
Also, I suggest you read:
RegEx match open tags except XHTML self-contained tags
Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.
EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup, which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).