Best open source library or application to crawl and data mine web sites

https://stackoverflow.com/questions/759363

09-09-2019
|

Question

I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads.

Solution

I do a lot of scraping, using excellent python packages urllib2, mechanize and BeautifulSoup.

I also suggest to look at lxml and Scrapy, though I don't use them currently (still planning to try out scrapy).

Perl language also has great facilities for scraping.

OTHER TIPS

PHP/cURL is a very powerful combination, especially if you want to use the results directly in a web page...

In common with Mr Morozov I do quite a bit of scraping too, principally of job sites. I've never had to resort to mechanize, if that helps any. Beautifulsoup in combination with urllib2 have always been sufficient.

I have used lxml, which is great. However, I believe it may not have been available with Google apps a few months ago when I tried it, if you need that.

My thanks are due to Mr Morozov for mentioning Scrapy. Hadn't heard of it.

Besides Scrapy, you should also look at Parselets

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow