4.6. improvement #5: Stripping out the HTTP response

One of the things that this tutorial has completely glossed over so far is the HTTP protocol itself. This section attempts to answer the following questions about the HTTP protocol:

First, a bit of background:

If you listen to the hype and the buzzwords circulating through the media, you might get the idea that 'The Internet' and 'The World Wide Web' are synonyms. In fact, they really aren't.

The Internet is the large, publicly-accessible, network of computers that are connected together using the TCP/IP protocol suite. The Internet is nothing more or less than a very large network of computers. All it does is allow programs to talk to other programs.

The World Wide Web is one application that can be run over The Internet. The web involves hypertext documents being transported from a web server to a PC where they can be viewed in a web browser. So, you have two programs communicating with each other. A client program ("the web browser") and a server program ("the web server").

The client and server programs use the TCP protocol to connect up with each other. Once they are connected, the client will need to know what data actually needs to be sent to the server in order to get back the document that it wants to display. Likewise, the web server has to understand the client's requests, and has to know what an acceptable response will be. Clearly, a common set of "command phrases" and "responses" must be understood by both the client and the server in order for them to get meaningful work done. This set of phrases and responses is called an 'application-level protocol.' In other words, it's the protocol that web applications use to talk to each other.

The web's protocol is offically called 'HyperText Transfer Protocol' (or HTTP for short) and it is an official Internet standardized protocol, developed by the Internet Engineering Task Force. The exact semantics of HTTP are described in a document referred to as a 'Request For Comments' document, or RFC for short.

The web is, of course, only one of thousands of applications that can utilize the Internet. All of these have their own Application Level Protocol. Most of these are also based on an official internet standard described as an RFC. Some examples of these other protocols include: File Transfer Protocol (FTP), Telnet Protocol, Simple Mail Transport Protocol (SMTP), Post Office Protocol (POP), etc.

All of the RFCs that describe Internet Protocols are available to the public at this site: http://www.rfc-editor.org

(Whew!) Back to the questions, then:

Q: "How do we know that we need to send 'GET ' followed by a filename, followed by 'HTTP/1.0' to request a file?"

A: We know how the HTTP protocol works by looking at RFC 2616. There are many, many exact details that should be taken into account to write a good HTTP client, so if you're looking to write a professional quality application, you should make sure that you've read RFC 2616 very carefully.

Q: Why do we send an extra x'0D0A' after our request string?

A: RFC 2616 refers to a 'request chain', and a 'response chain'. Which involves the client sending a request (in our case a GET request) followed by zero or more pieces of useful information for the web server to use when delivering the document. The request chain ends when a blank line is send to the server. Since lines of ASCII text always end with x'0D' (CR) followed by x'0A' (LF), it looks like a blank line to the web server, and terminates our 'request chain.'

Q: Why is there extra data appearing before the actual page that I tried to retrieve?

A: The server replies with a 'response chain'. Like the 'request chain' it contains a list of one or more responses, terminated with a blank line. However, in our simple client program, we were not trying to interpret these responses, but merely displaying them to the user.

Finally! Let's make an improvement!

Completely implementing the HTTP protocol in our sample client would be a bit too much for this tutorial. After all, this is a tutorial on socket programming, not HTTP! However, a simple routine to strip the HTTP responses from the data that we display should be simple enough, so lets do that.

Instead of the 'DsplyLine' subroutine in our original client, we'll utilize our nifty new 'rdline' procedure. We'll read back responses from the server until one of them is a blank line. Like this:

         c                   dou       recbuf = *blanks
         C                   eval      rc = rdline(sock: %addr(recbuf):
         c                                         %size(recbuf): *On)
         c                   if        rc < 0
         C                   eval      msg = %str(strerror(errno))
         c                   dsply                   msg
         c                   callp     close(sock)
         c                   return
         c                   endif
         c                   enddo
     

And then, we will receive the actual data, which ends when the server disconnects us. We'll display each line like this:

         c                   dou       rc < 0
         c                   eval      rc = rdline(sock: %addr(recbuf):
         c                                         %size(recbuf): *On)
         c                   if        rc >= 0
         c     recbuf        dsply
         c                   endif
         c                   enddo
     

The result should be the web page, without the extra HTTP responses at the start.