Python Selenium 2 - Grabbing HTML Source with Minimal Impact

https://stackoverflow.com/questions/21649752

08-10-2022
|

Question

I'm fairly new to programming, and very new to Python. I'm using Selenium to access a website and push some buttons, but while I'm at that website I also need the source code. I know how to do this using urllib and Selenium, but what I don't know is how to minimize the amount of requests I'm making to the website. I don't want my program to annoy the owners of the site.

I'd imagine that since I'm already at that website using Selenium, that using Selenium's .page_source would be the way to go.

As an aside, is there a rule of thumb as to how many requests are too many, in say, a 24 hour period?

Solution

a webdriver instance has the page_source property, which contains the current page's source.

for example:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://example.com')
print browser.page_source
browser.quit()

I don't know is how to minimize the amount of requests I'm making to the website.

Reading the driver's page_source just fetches it from the browser, so no additional http request is made to the server.

As an aside, is there a rule of thumb as to how many requests are too many, in say, a 24 hour period?

Do you own the site or is it someone else's public facing site? If it's yours, follow your hosting providers bandwidth limits, and your hardware limits. If you don't own it, follow the site's terms of service and respect their robots.txt. (This is probably best answered as a seperate question)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow