Question

I have written simple to code to get the content-type of a given URL. To make the processing faster, I made a change to set the request method as HEAD

// Added a random puppy face picture here 
// On entering this query in browser (or Poster<mozilla> or Postman<chrome>), the
// content type is shown as image/jpeg

URL url = new URL("http://www.bubblews.com/assets/images/news/521013543_1385596410.jpg");    

HttpURLConnection connection = (HttpURLConnection) url
        .openConnection();
connection.setRequestMethod("HEAD");
connection.connect();
String contentType = connection.getContentType();
System.out.println(contentType);
if (!contentType.contains("text/html")) {
    System.out.println("NOT TEXT/HTML");
    // Do something
}

I am trying to achieve something if it is not text/html, but when I set the request method as HEAD, the content-type is shown as text/html. If I fire the same HEAD request using Poster or Postman, I see the content-type as image/jpeg.

So what is it that makes the content-type change in case of this Java code?. Can someone please point out any mistake that I may have made?

Note: I used this post as reference

Was it helpful?

Solution

You should probably add an Accept header and/or User-Agent header.

Most web servers deliver different content depending on headers set by the client (e.g. web browser, Java HttpURLConnection, curl, ...). This is especially true for Accept, Accept-Encoding, Accept-Language, User-Agent, Cookie and Referer.

As an example, a web-server might refuse to deliver an image, if the Referer header does not link to an internal page. In your case, the web-server doesn't deliver images if it seems like some robot is crawling it. So if you fake your request like if it's coming from a web-browser, the server might deliver it.

When crawling web-sites, you should respect robots.txt (because you act like a robot). So strictly speaking you should be careful when faking User-Agent when doing a lot of requests or create a big business out of this. I don't know how big web-sites react on such behavior, especially when someone by-passes there business...

Please don't see this as a telling-off. I just wanted to point you to this, so you don't run into trouble. Maybe it's not a problem at all, YMMV.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top