Question

I have a Java application which opens an existing company's website using the Socket class:

Socket sockSite;
InputStream inFile = null;
BufferedWriter out = null;

try
{
    sockSite = new Socket( presetSite, 80 );
    inFile = sockSite.getInputStream();
    out = new BufferedWriter( new OutputStreamWriter(sockSite.getOutputStream()) );
}
catch ( IOException e )
{
    ...
}

out.write( "GET " + presetPath + " HTTP/1.1\r\n\r\n" );
out.flush();

I would read the website with the stream inFile and life is good.

Recently this started to fail. I was getting an HTTP 301 "site has moved" error but no moved-to link. The site still exists and responds using the same original HTTP reference and any web browser. But the above code comes back with the HTTP 301.

I changed the code to this:

URL url;
InputStream inFile = null;

try
{
    url = new URL( presetSite + presetPath );
    inFile = url.openStream();
}
catch ( IOException e )
{
    ...
}

And read the site with the original code from inFile stream and it now works again.

This difference doesn't just occur in Java but it also occurs if I use Perl (using IO::Socket::INET approach opening the website port 80, then issuing a GET fails, but using LWP::Simple method get just works). In other words, I get a failure if I open the web page first with port 80, then do a GET, but it works fine if I use a class which does it "all at once" (that just says, "get me web page with such-and-such an HTTP address").

I thought I'd try the different approaches on http://www.microsoft.com and got an interesting result. In the case of opening port 80, followed by issuing the GET /..., I received an HTTP 200 response with a page that said, "Your current user agent In one case, I tried the "port 80" open followed by GET / on www.microsoft.com and I received an HTTP 200 response page that said, "Your current user agent appears to be from an automated process...". But if I use the second method (URL class in Java, or LWP in Perl) I simply get their web page.

So my question is: how does the URL class (in Java) or the LWP module (in Perl) do its thing under the hood that makes it different from opening the website on port 80 and issuing a GET?

Was it helpful?

Solution

Most servers require the Host: header, to allow virtual hosting (multiple domains on one IP)

OTHER TIPS

If you use a packet capturing software to see what's being sent when URL is used, you'll realize that there's a lot more than just "GET /" being sent. All sorts of additional header information are included. If a server gets just a simple "GET /", it's easy to deduct that it can't be a very sophisticated client on the other end.

Also, HTTP 1.0 is "outdated", the current version is 1.1.

Java URL implementation delegates to HttpURLConnection if it starts with "http:"

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top