Our Logo  

Distributed Systems Lab 2003


Home Page
News & Feedback
Lab Organization
Lab Environment
Task Description
Overview
Lab 1
Lab 2
Lab 3
Lab 4
Certification Authority
WWW Tutorial
HTTP Tutorial
FAQ
Downloads

Hypertext Transfer Protocol (HTTP) Tutorial


HTTP (Hypertext Transfer Protocol) is an application-level protocol over TCP (Transfer Control Protocol) for distributed, collaborative hypermedia information systems. It is a generic, stateless protocol.

Though WWW browsers support a variety of protocols, e.g., FTP, NNTP, SMTP, etc., HTTP is the most frequently used protocol in combination with Web browsers. Several HTTP versions exist. Version 1.0 ( HTTP/1.0 in RFC1945 ) and version 1.1 ( HTTP/1.1 in RFC2616 ) are the most often found versions today. In the distributed systems lab, we focus on a subset of HTTP that is available in both, version 1.0 and 1.1.

HTTP is a simple request/reply (RR) protocol over TCP. The standard procedure when an HTTP request is done is (for simplicity no proxies, firewalls, etc. are considered here):

  1. Establish a TCP connection to the host given in the URL at the given port (the WWW server is supposed to listen there). If no host is given in the URL, connect to the local host. If no port number is given in the URL connect to port 80 (the default port).
  2. Send the HTTP request, e.g., GET /index.html HTTP/1.0 ( requests are explained in detail below)
  3. Receive the requested document from the WWW server ( replies are explained in detail below)
  4. Close the TCP connection (no explicit closing of the HTTP connection is necessary)

This interaction scheme can be easily tested with standard tools:

  1. Telnet to a WWW server, e.g., telnet www.dslab.tuwien.ac.at 80
  2. Issue, i.e. type in, the request, followed by two <CRLF> (hit the return key), e.g., GET / HTTP/1.0<CRLF><CRLF>
            
telnet www.dslab.tuwien.ac.at 80
Trying 128.131.172.54...
Connected to www.dslab.tuwien.ac.at.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Date: Mon, 16 Dec 2002 07:31:16 GMT
Server: Apache/1.3.12 (Unix) tomcat/1.0
Last-Modified: Son, 15 Dec 2002 14:57:00 GMT
ETag: "fc00-2b03-3c34713c"
Accept-Ranges: bytes
Content-Length: 11011
Connection: close
Content-Type: text/html

<html>
  <head>
    <title>
      Distributed System Lab
    </title>
    <link href=".//includes/text.css" rel="stylesheet" type="text/css" />
  </head>
  <body bgcolor="#12658E">
    <table border="0" cellpadding="0" cellspacing="0" width="100%">
      <tr>
        <td align="center">
          <img alt="Our Logo" src="./images/logo.gif" />
        </td>
        <td>
           
        </td>
        <td align="left" colspan="2">
          <br />
          <font color="white">
            <h1>
              Distributed Systems Lab 2003
            </h1>
          </font>
          <hr />
        </td>
      </tr>
      ...
      ...
  </body>
</html>
          

A browser such as Netscape uses the protocol in the same way. After each request the browser would parse the reply and check whether it needs additional requests for embedded elements like pictures, Java code, etc. These elements, if any, would then be retrieved via a new HTTP request.

HTTP Drawbacks and Problems

The above example shows a problematic drawback of HTTP/1.0: the 1-1-mapping of requests to elements is rather inefficient if requests are issued whose replies have many embedded elements. Each element requires a separate request, which means to

  • establish a connection to the remote host
  • send the request to the remote host
  • retrieve the document, image, etc. from the server
  • the server closes the connection

If, for example, an average HTML page with 10 images and a Java applet would be requested, this would result in at least 12 requests, i.e. 12 TCP connections, even though the connections could be reused, provided that the embedded elements reside on the same server. Additionally, if the Java applet is not stored in an archive file (.jar) each Java class of the applet would require a separate request.

A second drawback of HTTP stems from its R/R nature: if session-oriented services like databases are gatewayed to the Web, this conflicts with HTTP's request/reply scheme that has no session concept.

The problem when mapping session-oriented concepts to HTTP is that HTTP does not maintain this state/context information. Each HTTP request is completely stand-alone and separated from other requests. Nevertheless, such interaction patterns need to be supported. This has to be done outside the HTTP protocol by the application programmer which may be rather complicated.

HTTP/1.1 tries to overcome some of these drawbacks by introducing some new and improved features. The most important improvements are:


Document Description
Hostname identification The browser sends Host: www.which.one in the protocol header with each request. This speeds up accesses if one machine hosts several virtual web servers, e.g., the web servers www.myfirstvirtual.at and www.mysecondvirtual.at both could run on the same machine.
Content negotiation A document can be available in different formats, e.g., as plain text, PostScript, Portable Document Format (PDF), or pictures in different resolutions. Browsers can negotiate with the WWW server which format they prefer.
Persistent connections HTML pages usually do not consist of plain HTML but hold inline pictures, applets, etc. With HTTP/1.0 this meant a separate TCP/HTTP connection for each of the parts that constitute a pages. Thus an HTML page with 7 images and an applet requires at least (!) 9 connections. With persistent connections 1 connection can be used for multiple transfers.
Byte ranges This concept allows the client to specify which byte range it wants from a document. This can be very useful with long documents.
Proxy support Proxy caches have the problem that they do not know how long to keep certain documents in the cache. HTTP/1.1 now allows to take care of these validity problems.

Table 1 . HTTP 1.1 improvements.

HTTP Request

The following table defines the subset of HTTP requests, your HTTP component has to understand in a BNF style form ( exp1 | exp1 means either exp1 or exp2 , [ exp ] denotes an optional exp and ( exp )* denotes any sequence of exp including the empty one).

            
request = request-line <CRLF> 
          (general-header|request-header|entity-header)* <CRLF> 
          [entity-body]

request-line = GET <Space> absolute-url <Space> (HTTP/1.0|HTTP/1.1)

absolute-url = // the path of an absolute URL

general-header = // can be ignored; some non-empty lines of information,
                 // where each line is terminated by a single <CRLF>

request-header = // contains the 'Host' parameter in an HTTP 1.1 request.
		 // e.g., 'Host: www.dslab.tuwien.ac.at'
                 // all other entries can be ignored; 
                 // some non-empty lines of information,
                 // where each line is terminated by a single <CRLF>

entity-header = // can be ignored; some non-empty lines of information,
                // where each line is terminated by a single <CRLF>

          

Hint: Note that the HTTP request header section has to end with two CRLF!

The HTTP server answers a GET request with a response (as described below) specifying information about the requested document and the requested document itself.

HTTP Response

When a request is received at the server, the HTTP component of lab 4 has to check its correctness, process it, and return the reply to the client. The following table defines the subset of HTTP responses the lab 4 HTTP component must support.

            
response = status-line <CRLF>
           general-header
           response-header
           entity-header
           <CRLF>
           [entity-body]

status-line = HTTP/1.1 <Space> status-code+reason-phrase

status-code+reason-phrase = 200 <Space> OK |
                            400 <Space> Bad Request |
                            404 <Space> Not Found |
                            500 <Space> Internal Server Error |
                            501 <Space> Not Implemented

general-header = Date: <Space> date <CRLF> 
                 Connection: <Space> close <CRLF>

response-header = Server: <Space> vendor-string <CRLF>

entity-header = Content-Length: <Space> integer-greater-or-equal-0 <CRLF> 
                Content-Type: <Space> text/html <CRLF> 
                Last-Modified: <Space> date <CRLF>
                [ Cache-Control: <Space> no-cache <CRLF> ] // only for dynamic pages 
                [ Expires: <Space> date <CRLF> ]           // only for dynamic pages

entity-body = // the contents of the document requested by the client

date = // date format according to RFC822 and RFC1123

vendor-string = // server identification 
                // (freely definable by the server implementor)

          

Hint: Note that the entity-body starts after the first blank line of the response. (i.e. there must be two CRLF ahead!).

The order of the general-header , response-header and entity-header expressions (lines) is not important. The required lines simply have to be present in the response.

The status codes in the HTTP response have the following meanings:


Code Meaning
200 The request has succeeded and returned a document.
400 A malformed HTTP request was received, e.g., GET /index.html (without the specification of the HTTP protocol version used)
404 The URL requested was not found
501 The client used an unknown method in the request, e.g., GETX / HTTP/1.0
500 All other server errors

Table 2 . HTTP status codes and their meaning.

The response headers give some information about the server, the connection, and the entity being returned by the server. Other headers give some information about the document itself. This allows the client to identify the document's size a priori and display the estimated transfer time, or more importantly to determine wether the document is cacheable:


Header Meaning
Date represents the date and time at which the reply was generated. Its format is defined in RFC822 and RFC1123 .
Connection allows the sender to specify options that are desired for that particular connection. Since the HTTP component does not implement persistent connections, this header with the value Close must always be included in the reply.
Server indicates the name of the vendor of the HTTP server.
Content-Length indicates the length of the entity-body in bytes, sent to the recipient. Even though this field is optional according to RFC2616 , for the lab it must be present in every response of the HTTP component!
Content-Type indicates the media type of the content transferred to the client. Comprehensive information about content types can be found in RFC1521 .
Last-Modified indicates the date and time at which the sender believes the resource was last modified. The exact meaning of this header field depends on the implementation of the server and the nature of the original resource. For files, it may be just the file system last-modified time. For entities with dynamically included parts, it may be the most recent of the set of last-modify times for its component parts. For database gateways, it may be the last-update time stamp of the record.
Cache-Control controls the caching behavior of all the HTTP caches between the client and the HTTP server (including the client's cache). The only value valid for the lab's HTTP component is no-cache indicating that the document must not be cached by any cache. This is useful for confidential documents, or if the document is generated dynamically and will change after every request. If you want a document to be cacheable, you must not include this header - use the Expires header only! This header is mandatory only for dynamic pages (e.g. search results).
Expires gives the date/time after which the response is considered stale. A stale cache entry may not be returned by a cache unless it is first re-validated with the Web server.
The presence of an Expires field does not imply that the original resource will change or cease to exist at, before, or after that time. To mark a response as "already expired," a Web server sends an Expires date that is equal to the Date header value. This header is mandatory only for dynamic pages (e.g. search results).

Table 3 . HTTP response headers.

If you are not sure if your HTTP server shows the correct behavior a look at the FAQ page might help.

Examples

The following examples should help you to understand HTTP in greater detail. Whenever you have any doubts about HTTP, either check RFC2616 , or send a test request to an HTTP server (e.g., www.dslab.tuwien.ac.at) and have a look at that server's response.

A typical HTTP/1.0 example has already been presented in one of the previous sections. The following shows a typical HTTP/1.1 example (Note: In the previous HTTP/1.0 example no host-line Host: www.infosys.tuwien.ac.at was given.):

            
user@host:~% telnet www.infosys.tuwien.ac.at 80
Trying 128.131.172.91...
Connected to www.infosys.tuwien.ac.at.
Escape character is '^]'.
GET /Teaching/Courses/RN.html HTTP/1.1
Host: www.infosys.tuwien.ac.at

HTTP/1.1 200 OK
Date: Mon, 16 Dec 2002 07:59:02 GMT
Server: Apache/1.3.14 (Unix) tomcat/1.0 PHP/4.0.3pl1 mod_ssl/2.7.1 OpenSSL/0.9.4
Last-Modified: Tue, 23 Oct 2001 09:54:46 GMT
ETag: "67987-2095-3bd53e66"
Accept-Ranges: bytes
Content-Length: 8341
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
...
</HTML>

          

The example above shows the Host request header which is mandatory in HTTP/1.1. This request header must be present in all HTTP/1.1 requests. If it is not present, the server must reply with the status code "400 Bad Request". This is necessary for multi-homed HTTP servers (servers that have been assigned multiple names and should act differently depending on the URL's hostname). Your HTTP component has to accept HTTP/1.0 and HTTP/1.1 requests. If in an HTTP/1.1 request the host line is missing, your server should should response with a "400 Bad Reqest". For HTTP/1.0 requests, the host line need not be checked.

The following shows a typical HTTP error (the user requests a document that is not available on the HTTP server):

            
user@host:~% telnet www.infosys.tuwien.ac.at 80
Trying 128.131.172.91...
Connected to www.infosys.tuwien.ac.at.
Escape character is '^]'.
GET /unlikelytoexist HTTP/1.1 
Host: www.infosys.tuwien.ac.at

HTTP/1.1 404 Not Found
Date: Mon, 16 Dec 2002 08:01:14 GMT
Server: Apache/1.3.14 (Unix) tomcat/1.0 PHP/4.0.3pl1 mod_ssl/2.7.1 OpenSSL/0.9.4
Last-Modified: Wed, 05 Sep 2001 11:28:58 GMT
ETag: "2222c2-282-3b960c7a"
Accept-Ranges: bytes
Content-Length: 642
Content-Type: text/html

<html>
  ... some error document goes here ...
</html>

          

The following example shows a typical request including all the request headers of a Netscape Web browser requesting a /log from an HTTP server:

            
GET /log HTTP/1.0  
Connection: Keep-Alive   
User-Agent: Mozilla/4.03 [en] (X11; I; HP-UX B.10.20 9000/777)   
Host: w0:4711   
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*  
Accept-Language: en   
Accept-Charset: iso-8859-1,*,utf-8
          


Powered by MyXML Last update on: 2003-03-13
© 2001 Distributed Systems Group