Question

I'm building a webcrawler and have a method to check for bad link. At one point I am trying to get the HTTP response code to determine if it is valid or not. despite handing it a valid URL (opened it in a browser just fine) it still returns that it isn't valid. Here is the code:

public static boolean isBrokenLink(URL baseURL, String theHREF) {
        boolean isBroken = false;
        if (baseURL == null) {
            try {
                baseURL = new URL("HTTP", "cs.uwec.edu/~stevende/cs145testpages/", theHREF);
                System.out.println(baseURL);
            } catch (MalformedURLException e) {
                isBroken = true;
                //e.printStackTrace();
            }
        }
        try {
            URLConnection con = baseURL.openConnection();
            HttpURLConnection httpProtocol = (HttpURLConnection) con;
            System.out.println(httpProtocol.getResponseCode());
            if (httpProtocol.getResponseCode() != 200 && httpProtocol.getResponseCode() == -1) {
                isBroken = true;
            }
        } catch (IOException e) {
            isBroken = true;
            e.printStackTrace();
        }

        return isBroken;
    }   
            }

And here is the URL I'm passing it. isBroken is the boolean that is being returned. I passing baseURL as null and theHREF as a relative link (page2.htm). I'm printing out the URL after creating it from the string. Thanks for any help! Here is the error:

java.net.UnknownHostException: cs.uwec.edu/~stevende/cs145testpages/
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at java.net.Socket.connect(Socket.java:528)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1300)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
    at edu.uwec.cs.carpenne.webcrawler.Webcrawler.isBrokenLink(Webcrawler.java:106)
    at edu.uwec.cs.carpenne.webcrawler.Webcrawler.main(Webcrawler.java:181)
Was it helpful?

Solution

The exception tells us, that it is using the hostname and the local part as the (unknown) host. This looks like you have constructing the URL incorrectly. Maybe you forgot to use http:// prefix or used the wrong getters? You can debug it by calling baseURL.getHost(), baseURL.getPath() and baseURL.getProtocol() to see if it returns cs.uwec.edu and /~steve... and http.

I just noticed you added the baseURL with new URL("HTTP", "cs.uwec.edu/~stevende/cs145testpages/", theHREF) this is wrong, you need to use new URL("http", "cs.uwec.edu", 80, "/~stevende/cs145testpages/#"+theHREF). You can however typically skip the anchor/ref, as it will not transmitted to the server.

You can also use the single argument constructor new URL("http://cs.uwec.edu//~stevende/cs145testpages/").

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top