문제

I'm attempting to write a web scraper here and the website is returning a 403 forbidden to my code even though it is an accessible webpage through a browser. My main question is: is this something that they set up on the website to discourage web scraping or am I doing something wrong?

import java.net.*;
import java.io.*;

public class Main {
    public static void main(String[] args) throws Exception {

        URL oracle = new URL("http://www.pcgs.com/prices/");
        BufferedReader in = new BufferedReader(
        new InputStreamReader(oracle.openStream()));

        String inputLine;
        while ((inputLine = in.readLine()) != null)
            System.out.println(inputLine);
        in.close();
    }
}

If I change the url to a website like http://www.google.com then it will return html. If the site is blocking it is there a way around that? Thanks for the help

도움이 되었습니까?

해결책 2

The web server may contain a code to block not authorized user-agent.

I guess you can verify this by making sure your program will send a standard User-Agent value (i.e. corresponding to an existing web browser) and see if it makes any difference.

다른 팁

Don't know much Java, but this simple Python code worked when I tried it without an error, saving the content as it appeared in my browser:

import requests                                                                            

r = requests.get('http://www.pcgs.com/prices/')    

with open('out.html', 'w') as f:
    f.write(r.content)

This sends a slightly unusual, non-browser user-agent.

So, if their site isn't likely blocking you on the basis of user-agent, maybe you've hit the site too quickly and they've blocked your IP address or rate limited you? If you're intending on scraping sites, you should be nice and limit the number of requests you make.

Another thing you can do before scraping is check for a site's robots.txt; like this one for Stack Overflow; that explicitly declares what the site's policies are towards automated scrapers. (In this case, the PCGS site doesn't appear to have one.)

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top