Various Urllib2 errors when running Selenium webdriver on a VPS

https://stackoverflow.com/questions/20411814

29-08-2022
|

Question

I'm using Selenium with Python bindings to scrape AJAX content from a web page with headless Firefox. It works perfectly when run on my local machine. When I run the exact same script on my VPS, errors get thrown on seemingly random (yet consistent) lines. My local and remote systems have the same exact OS/architecture, so I'm guessing the difference is VPS-related.

For each of these tracebacks, the line is run 4 times before an error is thrown.

I most often get this URLError when executing JavaScript to scroll an element into view.

File "google_scrape.py", line 18, in _get_data
    driver.execute_script("arguments[0].scrollIntoView(true);", e)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
    {'script': script, 'args':converted_args})['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

Occasionally I'll get this BadStatusLine when reading text from an element.

  File "google_scrape.py", line 19, in _get_data
    if e.text.strip():
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
    return self._parent.execute(command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 373, in _read_status
    raise BadStatusLine(line)
httplib.BadStatusLine: ''

A couple times I've gotten a socket error:

  File "google_scrape.py", line 19, in _get_data
    if e.text.strip():
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
    return self._parent.execute(command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer

I'm scraping from Google without a proxy, so my first thought was that my IP address is recognized as a VPS and put under a 5-time page-manipulation limitation or something. But my initial research indicates that these errors would not arise from being blocked.

Any insight into what these errors mean collectively, or on the necessary considerations when making HTTP requests from a VPS would be much appreciated.

Update

After a little thinking and looking into what a webdriver really is -- automated browser input -- I should have been confused about why remote_connection.py is making urllib2 requests at all. It would seem that the text method of the WebElement class is an "extra" feature of the python bindings that isn't part of the Selenium core. That doesn't explain the above errors, but it may indicate that the text method shouldn't be used for scraping.

Update 2

I realized that, for my purposes, Selenium's only function is getting the ajax content to load. So after the page loads, I'm parsing the source with lxml rather than getting elements with Selenium, i.e.:

html = lxml.html.fromstring(driver.page_source)

However, page_source is yet another method that results in a call to urllib2, and I consistently get the BadStatusLine error the second time I use it. Minimizing urllib2 requests is definitely a step in the right direction.

Update 3

Eliminating urllib2 requests by grabbing the source with javascript is better yet:

html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))

Conclusion

These errors can be avoided by doing a time.sleep(10) between every few requests. The best explanation I've come up with is that Google's firewall recognizes my IP as a VPS and therefore puts it under a stricter set of blocking rules.

This was my initial thought, but I still find it hard to believe because my web searches return no indication that the above errors could be caused by a firewall.

If this is the case though, I would think the stricter rules could be circumvented with a proxy, though that proxy might have to be a local system or tor to avoid the same restrictions.

Solution

As per our conversation, you discovered that even for a small number of daily scrapes, Google has anti-scraping blocking in place. The solution is to put a delay of a few seconds between each fetch.

In the general case, since you are technically transferring non-recoverable costs to a third party, it is always good practice to try to reduce the extra resource load you are placing upon the remote server. Without pauses between HTTP fetches, a fast server and connection can cause a remote denial of service, especially to scrape targets that do not have Google's server resources.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow