best library to do web-scraping

https://stackoverflow.com/questions/67056

09-06-2019
|

Question

I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?

Solution

If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).

An extremely capable library, makes scraping a breeze.

OTHER TIPS

The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");

You can find it here. http://www.codeplex.com/htmlagilitypack

I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:

require 'rubygems'
require 'hpricot'
require 'open-uri'

sites = %w(http://www.google.com http://www.stackoverflow.com)

sites.each do |site|
  doc = Hpricot(open(site))

  # iterate over each div in the document (or use xpath to grab whatever you want)
  (doc/"div").each do |div|
    # do something with divs here
  end
end

For more on Hpricot see http://code.whytheluckystiff.net/hpricot/

I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).

For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).

i think watir or selenium are the best choices. Most of the other mentioned libraries are actually HTML parsers, and that is not what you want... You are scraping, if the owner of the website wanted you to get to his data he'd put a dump of his database or site on a torrent and avoid all the http requests and expensive traffic.

basically, you need to parse HTML, but more importantly automate a browser. This to the point of being able to move the mouse and click, basically really mimicking a user. You need to use a screencapture program to get to the captchas and send them off to decaptcha.com (that solve them for a fraction of a cent) to circumvent that. forget about saving that captcha file by parsing the html without rendering it in a browser 'as it is supposed to be seen'. You are screenscraping, not httprequestscraping.

watir did the trick for me in combination with autoitx (for moving the mouse and entering keys in fields -> sometimes this is necessery to set of the right javascript events) and a simple screen capture utility for the captcha's. this way you will be most succesfull, it's quite useless writing a great html parser to find out that the owner of the site has turned some of the text into graphics. (Problematic? no, just get an OCR library and feed the jpeg, text will be returned). Besides i have rarely seen them go that far, although on chinese sites, there is a lot of text in graphics.

Xpath saved my day all the time, it's a great Domain Specific Language (IMHO, i could be wrong) and you can get to any tag in the page, although sometimes you need to tweak it.

What i did miss was 'reverse templates' (the robot framework of selenium has this). Perl had this in CPAN module Template::Extract, very handy.

The html parsing, or the creation of the DOM, i would leave to the browser, yes, it won't be as fast, but it'll work all the time.

Also libraries that pretend to be Useragents are useless, sites are protected against scraping nowadays, and the rendering of the site on a real screen is often necessery to get beyond the captcha's, but also javascript events that need to be triggered for information to appear etc.

Watir if you're into Ruby, Selenium for the rest i'd say. The 'Human Emulator' (or Web Emulator in russia) is really made for this kind of scraping, but then again it's a russian product from a company that makes no secret of its intentions.

i also think that one of these weeks Wiley has a new book out on scraping, that should be interesting. Good luck...

I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.

The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.

I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html

WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.

LWP would give you a basic crawler for you to build on.

There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.

What language do you want to use?

curl with awk might be all you need.

You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.

I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.

what someone said.

use ANY LANGUAGE.

as long as you have a good parser library and http library, you are set.

the tree stuff are slower, then just using a good parse library.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow