Java web-scraper sees captcha

https://stackoverflow.com/questions/16686251

30-05-2022
|

문제

I have made a web-scraper for Google Scholar in Java with JSoup. The scraper search Scholar for a DOI and finds the citations for this paper. This data is needed for a research.

But, the scraper only works for the first requests. .. After that the scraper encounters a captcha on the Scholar site.

However, when I open the website in my browser (Chrome) Google Scholar opens normally.

How is this possible? All request come from the same IP-address! So far I have tried the following options:

Choose a random user-agent for the request (from a list of 5 user-agents)
Random delay between request between 5- 50 seconds
Use a TOR-proxy. However almost all the end-nodes have already been blocked by Google

When I analyse the request made by Chrome to Scholar I see that a cookie is used with some session ID's. Probably this is why Chrome requests are not blocked. Is it possible to use this cookie for request made with JSoup?

Thank you!

해결책

There's three things that spring to mind:

You aren't saving the cookies between requests. Your first request should save the cookie and pass it to the server for the next request (setting the Referer header wouldn't hurt too). There's an example here.
If Google was being tricky they could see that your first request didn't load any css/js/images on the page. This is a sure sign that you are a bot.
Javascript is doing something in the page once you have it loaded.

I think the first is the most likely option. You should try copy as many of the headers you see in the request from Chrome into your java code.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow