Question

So this is in regards to scraping yes; no language in particular. Some sites allow you to see a JSON modal if you pull it directly from a web browser. But, at any notion a program is used, immediately declines the request and asks for an API? How does the site know the difference between user request and say a request from a CLI or selenium?

Was it helpful?

Solution

In general, websites look for anomalous traffic patterns. Cookies that don't match up, requests made in the wrong order, that sort of thing.

If a website is keenly interested in this distinction, they can find out if you're a human user by presenting a challenge that is hard for automated programs to negotiate, like a Captcha.

There are other techniques they can employ to limit "bots," like rate limiting.

Beyond that, some websites look at things like the user agent string and session tokens, things that can be easily defeated by a savvy scraper. Generally, what you want to do is look at the Internet traffic using WireShark or Fiddler, and mimic the traffic that the web browser produces. Selenium doesn't do that out of the box.

Licensed under: CC-BY-SA with attribution
scroll top