Question

My recent assignment is to make a proxy in C using socket programming. The proxy only needs to be built using HTTP/1.0. After several hours of work, I have made a proxy that can be used with Chromium. Various websites can be loaded such as google and several .edu websites; however, many websites give me a 404 error for page not found (these links work fine when not going through my proxy). These 404 errors even occur on the root address "/" of a site... which doesn't make sense.

Could this be a problem with my HTTP request? The HTTP request sent from the browser is parsed for the HTTP request method, hostname, and port. For example, if a GET request is parsed from the browser, a TCP connection is established to the hostname and port provided, and the HTTP GET request is sent in the following format:

GET /path/name/item.html HTTP/1.0\r\n\r\n

This format works for a small amount of websites, but a 404 error message is created for the rest. Could this be the problem? If not, what else could possibly be giving me this problem?

Any help would be greatly appreciated.

Was it helpful?

Solution

One likely explanation is the fact that you've designed a HTTP/1.0 proxy, whereas any website on a shared hosting site will only work with HTTP/1.1 these days (well, not quite, but I'll get to that in a second).

This isn't the only possible problem by a long way, but you'll have to give an example of a website which is failing like this to get some more ideas.

You seem to understand the basics of HTTP, that the client makes a TCP connection to the server and sends a HTTP request over it, which consists of a request line (such as GET /path/name/item.html HTTP/1.0) and then a set of optional header lines, all separated by CRLF (i.e. \r\n). The whole lot is ended with two consecutive CRLF sequences, at which point the server at the other end matches up the request with a resource and sends back an appropriate response. Resources are all identified by a path (e.g. /path/name/item.html) which could be a real file, or it could be a dynamic page.

That much of HTTP has stayed pretty much unchanged since it was first invented. However, think about how the client finds the server to connect to. What you give it is a URL, like this:

http://www.example.com/path/name/item.html

From this it looks at the scheme which is http, so it knows it's making a HTTP connection. The next part is the hostname. Under original HTTP the assumption was that each hostname resolved to its own IP address, and then the client connects to that IP address and makes the request. Since every server only had one website in those days, this worked fine.

As the number of websites increased, however, it became difficult to give every website a different IP address, particularly as many websites were so simple that they could easily be shared on the same physical machine. It was easy to point multiple domains at the same IP address (the DNS system makes this really simple), but when the server received the TCP request it would just know it had a request to its IP address - it wouldn't know which website to send back. So, a new Host header was added so that the client could indicate in the request itself which hostname it was requesting. This meant that one server could host lots of websites, and the webserver could use the Host header to tell which one to serve in the response.

These days this is very common - if you don't use the Host header than a number of websites won't know which server you're asking for. What usually happens is they assume some default website from the list they've got, and the chances are this won't have the file you're asking for. Even if you're asking for /, if you don't provide the Host header then the webserver may give you a 404 anyway, if it's configured that way - this isn't unreasonable if there isn't a sensible default website to give you.

You can find the description of the Host header in the HTTP RFC if you want more technical details.

Also, it's possible that websites just plain refuse HTTP/1.0 - I would be slightly surprised if that happened on so many websites, but you never know. Still, try the Host header first.

Contrary to what some people believe there's nothing to stop you using the Host header with HTTP/1.0, although you might still find some servers which don't like that. It's a little easier than supporting full HTTP/1.1, which requires that you understand chunked encoding and other complexities, although for simple example code you could probably get away with just adding the Host header and calling it HTTP/1.1 (I wouldn't suggest this is adequate for production code, however).

Anyway, you can try adding the Host header to make your request like this:

GET /path/name/item.html HTTP/1.0\r\n
Host: www.example.com\r\n
\r\n

I've split it across lines just for easy reading - you can see there's still the blank line at the end.

Even if this isn't causing the problem you're seeing, the Host header is a really good idea these days as there are definitely sites that won't work without it. If you're still having problems them give me an example of a site which doesn't work for you and we can try and work out why.

If anything I've said is unclear or needs more detail, just ask.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top