Question

I'm using java and trying to get the content of a website so that I can analyze the text on the page, however every time that I "GET" a response from the server, it is from a login page rather than the website page that I am looking at.

I am logged into the website on all my browsers, but my application is not able to see the page as if it were me.

I also tried to use an API called "Yandex" --> http://api.yandex.com/rca/ as a work-around. But when I call the page from Yandex (which would get its content) I only see information based on the login page returned.

Can anyone give me a direction to investigate? I would like to be able to get one item on the page of a website that I work for, but it doesn't seem possible.

m_strseedpath = "http://myUrl.com/mypage.html"; //not https    
URLConnection connection = new URL("http://rca.yandex.com/?key={MyActualKeyNotThisText}&url=" + m_strSeedUrlPath + "").openConnection();
connection.setRequestProperty("Accept-Charset", "UTF-8");
InputStream response = connection.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(response, writer, "UTF-8");
String strString = writer.toString();

System.out.println(strString);
Was it helpful?

Solution

The URLConnection object will connect to the page but in a different session. You would have to programmaticaly log in from your Java code.

Create a URLConnection object to the login page, POST the user name and password, receive the content getting the InputStream from the URLConnection object, and finally create a new connection to the page you wish to analyze. You'd have to also work with cookies in order to view the second page.

Hope this helps!

OTHER TIPS

The URL that you are trying to access has access restricted via login. Even if you are logged in via your browser you wont be able to access the page from your Java application because the browser has an Authenticated Session with the target website. The same session is not visible to your Java Application.

You would have to research into the ways to login to the website and then get the page content.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top