Question

Background

Consider the following scenario:

  1. Link. User provides a link to some poorly formatted website (e.g., creative commons content).
  2. Scrape. Server downloads the content (web scrape), always throttled.
  3. Format. Server formats the content (e.g., performs natural language processing).
  4. Return. Server posts formatted results back to user.

Problem

The server hosting the poorly-formatted website (host) can block the server that pulls down the content (scraper). If this happens, the user can no longer use the service to automatically change the format.

Assume that the terms of service do not forbid scraping, nor is there an API available to pull the data directly.

Comments Regarding Copyright

  • The content is not subject to copyright: it is either creative commons content or already in the public domain.
  • The content would be from whitelisted domains that have been vetted (e.g., U.S. federal government works).
  • For what it's worth, I don't even know if the sites will block the requests (especially given how infrequent the requests will be and I will likely do some pre-caching). It's mostly academic at this point

Question

What strategies would you employ (such as using a virtual network, or cloud service) such that the IP address of the scraper can easily (potentially dynamically) change to avoid being blocked by the host?

No correct solution

Licensed under: CC-BY-SA with attribution
scroll top