Вопрос

I am trying to create a code to get the final redirected URL for the websites needed, but I am having trouble when handling the http response 302. It seems like it is not doing the request properly, otherwise I do not understand what's wrong. I have tested it forcing redirection with twitter.com and facebook.com and works great (responses 301 though), however when trying the URL indicated in "urlin" goes in an endless loop of http 302 responses.

This is my first post so I am not able to post the printout because it gives more than two links.

Here's the code:

/**
 * @param args
 */
public static void main(String[] args) {
    String urlin = "http://feeds.nashuatelegraph.com/~r/news/breaking/~3/jxDTXgSDSGc/jpmorgan-ex-workers-charged-in-london-whale-loss.html";
    String url = new String();
    try{
        System.out.println("URL to redirect: "+urlin);
        int iteration = 0;
        //Preparamos la conexión
        HttpURLConnection con =(HttpURLConnection) new URL(urlin).openConnection();
       // con.setRequestProperty("User-Agent", "Mozilla 5.0");
        con.setReadTimeout(20000);
        con.setInstanceFollowRedirects(false);

        //Definimos un booleano que hara de flag
        boolean redirect = true;
        //Iniciamos la busqueda de URL final
        while(redirect){
            System.out.println("\nIteration number: "+ ++iteration);
            con.connect();
            System.out.println("Connected URL: "+con.getURL().toString());
            int status = con.getResponseCode();
            System.out.println("status: "+status);
            //Tratamos el codigo de respuesta obtenido
            if (status != HttpURLConnection.HTTP_OK) {
                if (status == HttpURLConnection.HTTP_MOVED_TEMP
                        || status == HttpURLConnection.HTTP_MOVED_PERM
                        || status ==    HttpURLConnection.HTTP_SEE_OTHER){
                    redirect = true;
                    //Capturamos la nueva URL
                    String newUrl =     con.getHeaderField("location");
                    //Obtenemos la cookie por si se necesita
                    String cookies =    con.getHeaderField("Set-Cookie");
                    System.out.println("Cookies: "+cookies);
                    //Reabrimos la conexión
                    con = (HttpURLConnection) new URL(newUrl).openConnection();
                    if(cookies!=null){
                     con.setRequestProperty("Cookie", cookies);
                    }
                }
                //Tratamos los errores 400 y 404
                if (status == HttpURLConnection.HTTP_NOT_FOUND ||   status == HttpURLConnection.HTTP_BAD_REQUEST){
                    throw new Exception("Error 400 o 404"); 
                }
            } else {
                redirect = false;
                //Obtenemos la URL final
                url = con.getURL().toString();
            }
        }
    } catch (SocketTimeoutException e) {
        System.out.println("Se ha producido un timeout con URL: "+urlin);
    } catch (UnknownHostException e) {
        System.out.println("Direccion URL desconocida: "+urlin);
        e.printStackTrace();
    } catch (IOException e) {
        System.out.println("Error IOException al procesar el registro URL: "+urlin);
        e.printStackTrace();
    } catch (Exception e) {
        System.out.println("Error al procesar el registro URL: "+urlin);
        e.printStackTrace();
    } 
    if(!url.equals("")) {
        System.out.println("URL final: "+url);
    } else {
        System.out.println("URL final: "+urlin);    
    } 
}

I will appreciate all the advice you can give.

Это было полезно?

Решение

Seems that the site you are scraping does multiple redirections, using cookies to identify the actual step.

In your code, you are intercepting only the Set-Cookie header, but you are discarding previously set cookies (i.e the cookies set by response n-2).

Also, the site does a http/https switch, which you might need to consider in order to send the appropriate cookie set.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top