Browsing/parsing html pages in python

https://stackoverflow.com/questions/23136157

05-07-2023
|

Question

I'm trying to put together a little collection of plugins that I need in order to interact with html pages. What I need ranges from simple browsing and interacting with buttons or links of a web page (as is "write some text in this textbox and press this button") to parsing a html page and sending custom get/post messages to the server. I am using Python 3 and up to now I have Requests for simple webpage loading, custom get and post messages, BeautifulSoup for parsing the HTML tree and I'm thinking of trying out Mechanize for simple web page interactions.

Are there any other libraries out there that are similar to the 3 I am using so far? Is there some sort of gathering place where all Python libraries hang out? Because I sometimes find if difficult to find what I am looking for.

Solution

The set of tools/libraries for web-scraping really depends on the multiple factors: purpose, complexity of the page(s) you want to crawl, speed, limitations etc.

Here's a list of tools that are popular in a web-scraping world in Python nowadays:

There are also HTML parsers out there, these are the most popular:

Scrapy is probably the best thing that happened to be created for web-scraping in Python. It's really a web-scraping framework that makes it easy and straightforward, Scrapy provides everything you can imagine for a web-crawling.

Note: if there is a lot AJAX and js stuff involved in loading, forming the page you would need a real browser to deal with it. This is where selenium helps - it utilizes a real browser allowing you to interact with it by the help of a WebDriver.

Also see:

Hope that helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow