CGI Programming Unleashed
The CGI Specification
- CGI Overview
- CGI Methods
- Interface Specification
- More Information
The Common Gateway Interface (CGI) is an accepted standard for interfacing Web servers and external applications. Web servers were originally designed to serve static HTML documents along with other associated static files. A Web browser that communicates with a Web server that limits its functionality to serving static pages displays only documents whose contents will not change between requests or during page visualization.
A Web server is generally installed on a powerful computer, and it would be very frustrating not to be able to offer many more interesting and dynamic things to remote users, using the computer power available. The CGI specifications were created to answer this problem. CGI establishes a standard way of information exchange between Web servers and browsers (also called clients). It allows the passing of information between a browser or server to an external program that performs some actions and then outputs its results back to the user's browser. The external program is generally know as the CGI program, CGI script, CGI application, or simply gateway, because it makes use of the CGI specification and is specially designed for functioning on a Web platform. It is executed in real-time, by initiative of the user (even if sometimes nothing is noticed), and it can output dynamic on-the-fly information.
CGI is an interface specification. It does not define how a Web server works or how a program is expected to produce results, but it establishes a set of guidelines that both must follow in order to interoperate.
Let's look at an example. Imagine that you have a product database on your system that you would like users on the Web to use, but your Web server does not understand the database internals. You must link both the Web server and the database by using a CGI program. This special purpose program may be developed by you or provided by the database vendor and will be responsible for the database queries on one hand, and the communication with the Web server on the other hand. This last functionality works only because the Web server and the program have established rules for communication between the two. The rules make them able to interface-they are called the Common Gateway Interface. See Figure 2.1 for a representation of this example.
Another example could be the access to an Internet service-e-mail mailboxes, for example-that is not originally intended to work over the World Wide Web. One could implement a program that handles mailboxes and interacts with the Web server (and consequently, the Web browser) through CGI.
In fact, a CGI program may be a simple or a complex program and can perform any task a program is able to. The difference is that the program communicates with "the real world" by using the CGI "language."
CGI applications are often used to produce HTML pages on-the-fly (whose contents may change at each request). They are also often used to process the information introduced in HTML forms.
The CGI specification is implemented on Web servers, as well as on programs built for use over the Web. It is not part of the HyperText Transfer Protocol (HTTP), but most Web servers choose to implement this useful feature. Therefore, you are able to use CGI applications in most known Web servers, including ncSA httpd, CERN httpd, Apache httpd, and many other commercial servers.
These Web servers are usually distributed with a set of general purpose CGI programs that reside in a directory called cgi-bin, within the Web server root directory. This is the directory commonly used for CGI program storage, but the Webmaster is able to define other locations (and a security-addict Webmaster will probably do that). We suggest that you take a look at these examples available with one public domain Web server.
CGI applications can be written in any language that can be executed on a computer-in particular, a Web platform. In fact, you can choose any of the common languages for your CGI applications. Your choice depends on what you have to do because different languages may be specialized for different purposes. Perl, for instance, is great for string and file manipulation, while C is better for bigger, more complex programs. Perl and C are probably the most used languages for CGI programming. Feel free to choose from the following languages:
- Shell scripts (UNIX)
- Visual Basic
These languages, as well as many others, provide the programmer with the means to comply with the CGI specification and use it to its fullest potential.
A method is a way of invoking a CGI program. In fact, to execute the program, you make a request to the server using a method, which defines how the program receives the data. There are three main methods, as shown in the following sections.
When you use this method, the CGI program receives the data in the QUERY_STRING environment variable. The program must parse (process) the string in order to interpret the data and execute the needed actions. The GET method should be used when you want to obtain data from the server and you will not change any data on the server. Exceptions may appear when the data transmitted is very long so that eventual problems in the size of the variables are prevented. In this case, the POST method is preferred.
When you use the POST method, the Web server transmits the data to the CGI program through the stdin (standard input). The server does not mark the end of the data with an EOF character, so the program must use the CONTENT_LENGTH value in order to read the stdin correctly. You should use the POST method when the data you send will alter any data on the Web server or when you want to send large amounts of data to the CGI program (usually, more than 1024 bytes, the length limit of a URL).
The HEAD method is similar to the GET method, except that with the HEAD method, only the HTTP headers (and not the data itself) are sent by the Web server to the browser.
The following sections present the four major methods of communication between a Web server and a CGI program:
- Environment variables
- Command line
- Standard input
- Standard output
This presentation is based on the current version (1.1) of the
Common Gateway Interface specification. You can, however, expect
future versions to be backward compatible.
The environment variables are system specific variables set by the Web server when it executes a CGI application.
The following sections list the environment variables that are available. Note, however, that some servers may include some extra proprietary variables.
AUTH_TYPE gives the type of authentication used if the server supports authentication and the script is protected.
CONTENT_LENGTH gives the length, in bytes, of the data sent to the CGI program using the POST method. The CONTENT_LENGTH variable is empty if the GET method is used.
CONTENT_TYPE gives the MIME type of data sent to a CGI program invoked by the POST method. When using the GET method, the CONTENT_TYPE variable is empty. Sample usage: application/x-www-form-urlencoded.
GATEWAY_INTERFACE provides the name and version of the CGI specification being used. Sample usage: CGI/1.1.
PATH_INFO gives the extra path information that follows the name of the CGI program on a URL.
PATH_TRANSLATED is the physical path of the CGI program, which is usually the Web root directory, along with the script name and extra path information.
QUERY_STRING is the information that follows the ? character in the URL that referenced the CGI program. Using the GET method, QUERY_STRING will contain the input to the CGI program. Using the POST method, QUERY_STRING will be empty, unless something follows the CGI program name and the attached ? character on the URL.
REMOTE_ADDR is the IP address of the remote computer that made the request.
REMOTE_HOST is the name of the remote computer that made the request.
REMOTE_IDENT gives the username as defined in the RFC 931.
RFC 931 is an Internet official document that describes a means to determine the identity of a user on a TCP connection. You can find the document at
REMOTE_USER gives the authenticated username of the client that made the request, if applicable.
REQUEST_METHOD is the method with which the request of the CGI application was made, either one of the following: GET, HEAD, and POST.
SCRIPT_NAME is the virtual path to the CGI program being executed: for example, /cgi-bin/finger.cgi.
SERVER_NAME is the domain name or the IP address of the computer running the Web server software. Example: www.esoterica.com.
SERVER_PORT gives the port number on which the Web server is waiting for requests, which is usually 80, the default HTTP port number.
SERVER_PROTOCOL gives the name and version of the protocol the Web server is using. Example: HTTP/1.0.
SERVER_SOFTWARE gives the name of the Web server that executes the CGI program. The format in which it is presented consists of the name followed by a slash and the version number. Example: ncSA/1.5b5.
Additionally, the client may send HTTP header values to the CGI program as HTTP variables. These variables have the same name as the HTTP headers, with hyphen (-) characters replaced by underscore (_) characters, and small letters converted to capital letters.
HTTP_AccEPT is the contents of the Accept: header line sent by the client, corresponding to the MIME types the client can handle. Format: type/subtype,type/subtype,.... Example: */*, image/gif,image/jpeg.
HTTP_REFERER gives the contents of the Referer: header line, which contains the URL of the form from which the CGI request was originated. For example, the value of this variable could be http://www.your_host.com/comments.form if this form uses a CGI program to send results via mail (a form-by-mail gateway).
HTTP_USER_AGENT gives the name of the client program (the browser) that made the request. Mozilla/1.2N(Windows;I;32bit), for example.
You can find an example of the variables available to a CGI program by looking at the output of a CGI test program, called test-cgi, presented in Figure 2.2.
The HTML that generated this output appears in Figure 2.3.
The CGI command line is used only with ISINDEX queries. An ISINDEX query is a special query obtained with the <ISINDEX> tag and the <BASE HREF=".."> tag (referencing the script). The data entered by the user is sent to the CGI program via the command line, unless it contains the equal sign (=), in which case the QUERY_STRING is used instead. More than one parameter can be passed to the CGI program command line, because the Web server replaces any plus signs (+) received from the client with spaces.
The standard input (stdin) is used for the Web server to pass information to the CGI program when the POST method is used. The Web server is also responsible for sending the CONTENT_TYPE and CONTENT_LENGTH values, so that the CGI program knows what it is receiving and how long it is. The CONTENT_LENGTH value is a bytes count of the URL encoded data (spaces have been replaced by plus signs, tilde characters by %7E, and so on).
The CGI program sends results to the standard output. It may be sent directly to the user's browser or can be interpreted by the Web server in order for an action to be executed (redirection to another existing URL, for example). The CGI programs may overpass the server and talk to the browser directly. In order to distinguish these programs from regular ones, their names must start with nph- (this means No Parse Header, which results in the server ignoring any information, even HTTP or MIME headers). It is up to the CGI program to return valid HTTP headers to the browser.
But if an nph- program is not used, the server looks for any of three special headers that the CGI program may return:
- Content-type: This is the MIME type header. Usually, as CGI programs output HTML text for a browser to display, it is common to use Content-type: text/html\n\n. Notice the two newline characters by the end of the line. It is mandatory to put a blank line after an HTTP header.
- Location: Tells the server you are referencing another document. The server may either issue a Redirect to the client or send the contents of the referenced document, depending on whether it is a complete URL or a virtual (relative) path.
- Status: This is the status line the server should send to the client. Format: nnn xxxxx, where nnn is a three-digit code, and xxxxx is the corresponding description text.
For a quick example of a CGI program, let's take a look at a finger gateway that returns information about an e-mail address, using the finger client available in most UNIX platforms. The query is made with the ISINDEX tag. The finger CGI program presented here is included with every Apache server distribution. Be careful because it is not a secure finger gateway. A malicious user could invoke shell commands through it. This leads us to an important part of CGI design: security. See the following section for some pointers concerning this important issue.
Notice the e-mail address concatenated with the URL of the CGI finger gateway. It was sent to the finger client via the command line, as you can see in Figure 2.4. The HTML page in which you enter the e-mail address is presented in Figure 2.5. You can see an example of the finger information for firstname.lastname@example.org in Figure 2.6.
Here you will find pointers to interesting and important information about the CGI specification.
The essential CGI site is located at the National Centre for Supercomputing Applications. This is a must for everyone interested in mastering CGI:
M. Hedlund maintains a good FAQ on CGI programming:
Alan Richmond maintains a good site about the CGI specification:
Lincoln Stein maintains an excellent FAQ about World Wide Web security, in which we find a chapter dedicated to CGI security:
You can also find lots of interesting pointers in the CGI section of Yahoo!, at the following location:
See Figure 2.7 for the Yahoo! list of CGI references.
The good stuff can be found at http://www.worldwidemart.com/scripts/. This is a site where you can find lots of good and useful CGI programs.
This chapter has described in detail the Common Gateway Interface specification. The CGI specification is an accepted standard for interaction between Web servers and other programs, developed to perform lots of different tasks. You can use the information in this chapter as a reference while you develop your own CGI programs in your preferred computer language.