CGI Programming Unleashed
Crash Course in CGI
- Why CGI Exists
- Wanna Have a Conversation?
- Parlez Vous Environment Variables?
- Taking It All In
- Some Things to Consider
Divide and conquer. If something is made up of a group of things, go after the parts of it you're comfortable tackling one by one, and before you know it, you'll have gotten through it all. It doesn't matter if it's a kid trying to finish all the vegetables he or she doesn't like or an adult trying to file a tax return; lots of things are just easier when you take them step by step.
With any new thing, like CGI, you can get overwhelmed trying to
take everything in at once. There are new concepts and things
that sometimes seem like background information you'll never use
unless you plan on winning a trivia game. But it's hard to know
what's important until you see how it all works in real-world
terms, so that you can picture how "mysterious CGI processes"
really just do some straightforward things for some very good
reasons-and in a specific order.
We're going to demystify those reasons and that order for you by showing you how you and CGI programs can communicate with one another. Starting at the very beginning, you're going to see things like the following:
- Why CGI exists
- What CGI really is
- Starting a CGI conversation
- How CGI gets information
- What you can do with that information
- How you can respond
As you'll see, there's nothing mysterious about the way CGI works. Unusual at first, perhaps, but nothing that can't be easily examined and put to work for you when you're ready to give it a try-and you'll be ready soon enough.
When kids are really little, they can't do everything by themselves. They try, and they give it their best effort, but when it comes to things that are too high up for them to reach or too dangerous for them to try, you step in to help them out to make sure things get done without any smashed fingers, broken toes, or other mishaps.
Web servers are like little kids. Heck, the World Wide Web has been around only a few years, and while these little tykes are growing fast, they still haven't learned to do everything they want to do. Each kid on the block is also very different, with his or her own set of values and his or her own personality. As they grow up, each of them will have their own likes and dislikes, as well as their own set of skills. Besides, while they're young, you can't expect them to do everything that you normally do in a day, can you?
CGI programs are the helpful older siblings of the Web server, the ones willing to do things that the servers haven't learned how to do, or just refuse to do. They'll help with math and get things out of storage for the servers to play with. Yes, there will occasionally be squabbles and disagreements, but if you convince the CGI program to bear with the server, and treat it nicely, a good deal of cooperation can take place.
This cooperation between Web servers and their CGI siblings allows things to get done that normally the Web server would have to turn away. It only knows how to do certain tasks, like answer when it's called and hand over a file that it has available. When it gets a request that it can't handle, it needs help.
When a Web server asks for help from a CGI program, it's basically striking up a conversation. Like any conversation, there are several parts. First, the server tells the CGI program that it wants to talk. Normally what causes this is that someone has sent a message to the server indicating that they want something special done, and that there's a specific file (in this case a CGI program) that can perform that special function. The user who started the request normally has no idea what CGI program they're calling-to them it might just have been a button on a page that says "Click Me," or a link, or something the server has been trained automatically to call on the CGI program for without checking with the user. An example would be any of the popular Web Search sites where you type data into a box and press the Search button. You don't necessarily keep track of what's being done, but as long as you get back the information you were looking for, it doesn't matter too much. No matter how the request came in, or what it's supposed to do, it's there, and the server needs to deal with it.
To start a CGI program, the server needs to find it first. Because it's been brought up right, the server doesn't go screaming around, looking everywhere for the CGI program, and disrupting everything else that's going on. Instead, the server knows where it can normally find the CGI program when it's available to help, and that's where it looks. For most servers, the place where the CGI programs are found is the cgi-bin directory.
After it finds the CGI program, if it's available, the server begins a conversation with it. This is the initialization phase, where initial contact starts up and things like "Hi, how are you?" greetings are exchanged before they get down to the real reason the CGI script is being called.
During initialization, the machine that the server software is on has to make additional space for the CGI program to execute, and it has to start a copy of the program. CGI programs have a one-track mind: Once they've been started, they carry on one conversation until they're done. If you want to talk to them about something else, you have to start up another copy of the CGI program so that the new conversation can take place. Every instance of the CGI program having a conversation with the server software is a process, and each process takes up a certain amount of space. Unless the conversations go on for a long time or get off track, most conversations are short enough that you can have large numbers of them going on at once, even though each is off on its own topic.
Once the conversation between the server and the CGI program has begun, the server needs to tell the CGI program what it wants. Because each CGI program normally has a special set of things it does, there's already some indication of what the server wants by which CGI program is being called. Like siblings, you might find one who's better at math and would be called to help with homework, while the other is better at drawing and would be called to help make a picture of a house.
To let the CGI program do its work, the server needs to give it all the information that's available about this particular request and let the server decide how to handle it. It also needs to present the information in a coherent manner so that the CGI program doesn't get first names mixed up with last names, look in the wrong place for other information, or just give up in frustration because nothing makes any sense. Fortunately, the server and the CGI programs have an agreed-upon way of sharing information so that nothing ever gets left out unless there's a real problem.
The agreed-upon method that servers and CGI programs use to exchange information is the use of environment variables. No matter what the request, the CGI program always knows it can expect certain pieces of information to be in a specific location and no matter what they look like, it will be certain of what the information is supposed to be used for.
Environment variables are nothing more than storage blocks that hold onto bits of information about the user. For instance, most computers have a PATH environment variable, which tells them locations where they can look for files if they don't find them in the current directory. When the server gets a request to do something, its first step is to gather all the relevant information it can think of and place it into storage. What kind of information does it gather?
- Details about itself
- Details about the user
- Details about the user's request
You see, it doesn't know what the CGI program's going to need to get the job done, and if it just collects all this information every time, it'll get in the habit and never accidentally forget to find out something important.
To show you what kind of information the server stores, we're going to look at each of the environment variables. Don't worry too much about memorizing them, or even completely understanding why in the world the server would store that kind of information. We'll come back to the important ones when we move into how the server gets hold of the data in the section "Taking It All In." Right now, you should just get a feel for how much is actually stored.
Servers like you to know who they are, so they tell you about
themselves. Normally, you already know this, so it's not of too
much use. If you've got a bunch of servers all calling the same
CGI scripts, that might be a different situation; but for the
most part, you can just look at these as a way of giving the server
its due. Table 3.1 shows the environment variables your server
uses to identify itself and the workings around it.
|GATEWAY_INTERFACE||CGI version the server complies with.|
|SERVER_NAME||Server's IP address or host name. Example: www.yahoo.com|
|SERVER_PORT||Port on the server that received the HTTP request. Usually 80 on most servers.|
|SERVER_PROTOCOL||Name and version of the protocol being used by the server to process requests.|
|SERVER_SOFTWARE||Name (and normally, version and platform) of the server software.|
|Example: Purveyor/v1.2 Windows NT|
Your server knows your CGI program, but it doesn't normally know
the user on the other end or the program that's being used to
contact it with the request. Because the server's going to give
you information about the user and what the user wants, it figures
it might as well give you some information about what the user's
HTML browser, in some cases called the client, is. Some
of the most useful pieces of information that can be obtained
from this data are what the specific program is that the user
has (Is it Netscape? Internet Explorer? Mosaic?) and what page
led them here (Was it a search engine? The front page of your
site?). All of the client environment variables begin with the
prefix HTTP_ because the
client is responsible for helping with the HyperText Transfer
Protocol (HTTP), the method Web browsers and servers use to
communicate with one another. Table 3.2 shows a variety of the
more common HTTP_environment
variables that you may want to take advantage of.
|AccEPT||Lists what kind of response schemes are accepted by this request.|
|AccEPT_EncODING||Lists what types of encoding schemes are supported by the client.|
|AccEPT_LANGUAGE||Identifies the ISO code for the language that the client is looking to receive.|
|AUTHORIZATION||Identifies verified users.|
|chARGE_TO||Sets up automatic billing (for future use).|
|FROM||Lists the client's e-mail address.|
|IF_MODIFIED_SIncE||Accompanies the GET request to return data only if the document is newer than the date specified.|
|PRAGMA||Sets up server directives or proxies for future use.|
|REFERER||Identifies the URL of the document that gave the link to the current document.|
|USER_AGENT||Identifies the client software, normally including version information.|
How, and if, you use client-specific headers is up to you. Remember, though, that there's a reason they're called client-specific: Not everyone's client is going to fill out all of this information. Depending on receiving a value from HTTP_FROM isn't going to work more than half the time because very few browsers currently support the capability. How do you find out for certain if something's supported or not? If you're really curious, because you think some of those bits of information look too good to pass up, you can find the latest information on browsers at http://www.webcompare.com. This list is updated quite frequently and provides a good, impartial viewpoint on what's supported and what's not.
Every time the server receives a request to do something, it's
different. This keeps life interesting. It also means that there
are a lot of pieces of information that may really matter to your
CGI program that it has to keep track of. These request-specific
environment variables include everything from where the user is
calling from, to how the user has sent the request, to how much
(and what) information they've sent along as part of the request.
This is where the real goldmine of information lies for your program,
so we'll take time out to cover a few of these environment variables
in detail, which you'll find listed in Table 3.3. Several of the
most important request-specific environment variables are on the
list but won't be discussed just yet. Those important variables
deserve very special attention, which we'll give them in the next
section "Taking It All In."
|AUTH_TYPE||Authentication scheme used by the server.|
|CONTENT_FILE||File that contains data for the CGI program.|
|Note: For WinCGI/Windows HTTPd only.|
|CONTENT_LENGTH||Number of bytes sent to Standard Input (STDIN) due to a POST request.|
|CONTENT_TYPE||The MIME type of data being sent.|
|OUTPUT_FILE||File that the CGI program should place data into.|
|Note: For WinCGI/Windows HTTPd only.|
|PATH_INFO||Additional path information for the CGI program, passed as part of the URL after the program name.|
|PATH_TRANSLATED||The translated version of PATH_INFO, which points to the absolute directory.|
|QUERY_STRING||Data passed to the CGI program as part of the URL, consisting of anything after the question mark (?). In the example, "name1=value1" is the QUERY_STRING.|
|REMOTE_ADDR||The end user's IP address or host name.|
|REMOTE_USER||User name, if the user has been authenticated.|
|REQUEST_LINE||The complete HTTP request line, sent to the server.|
|Example: GET /scripts/mine.pl?hi HTTP/1.0|
|REQUEST_METHOD||The method used to pass data as part of the HTTP request.|
|SCRIPT_NAME||CGI script being run.|
The three most important and most frequently used environment variables are
The reason these are the "big three" you need to familiarize yourself with is that this combination tells you how data got to the CGI program; after you know that, all you have to do is get it. We'll cover these three, as mentioned earlier, in much more detail in the next section.
What use are the other environment variables? Plenty. You can find out if people from your competition are accessing your programs, you can see if they're registered users, and you can set up links to your CGI programs so that extra path information gets included in the request-and you don't have to figure out what directory they were really looking for.
CONTENT_FILE and OUTPUT_FILE deserve special mention because not everyone uses them. Because Windows 3.1 (and DOS) don't have too many programming languages that let you read and write to Standard Input (STDIN) and Standard Output (STDOUT), substitutes were needed.
It's good to know that all the information needed by the CGI program is stored, and knowing the kinds of information stored is even better. But how does any of it get to the CGI program? Well, let's find out.
When the user makes a request for the server to execute a CGI program, additional data the user may want to send along gets stored by the server. The problem is that it can be stored in one of two ways, depending on how the user ended up starting the conversation with the server. The user doesn't really have any control over this, so they can't help you. What you need to do is find where the server wrote down how information was sent. That place is the REQUEST_METHOD environment variable.
There are two commonly used values for the REQUEST_METHOD environment variable: GET and POST. By looking at REQUEST_METHOD and seeing which of these methods the request used, the CGI program can then decide where its data is hiding and can go out and get it. What's different about the two methods? Find out in the following sections.
When the server uses the GET method to process a request, it has a very simple way of dealing with the data being sent by the user: It tacks it on to the end of the URL (the location of your script). So, let's say that the URL to get to your script is http://yourplace.com/cgi-bin/mine.pl, and you just want to pass it the word "catapult". The new URL that gets sent to the server is http://yourplace.com/cgi-bin/mine.pl?catapult. The question mark (?) is what the CGI program (and the server) use to separate out where the data starts. Everything that comes after the question mark (?) is considered to be the QUERY_STRING environment variable. So in this case, the environment variable QUERY_STRING would just contain the word "catapult".
If you've ever seen an entry in an HTML file that looks something like <a href=/cgi-bin/mine.pl?data>, then you've seen an example of how the GET method works. If the URL already has a question mark, then it already has a QUERY_STRING, and the server automatically assumes that it's coming through the GET method. This easy method of calling a script with fixed data is often used for things like random link programs, viewing stock values for a specific company's stock, and other such things where the result may change and the users may change, but the data that goes into the program should stay the same.
Using the POST method allows the server to accept more information, so you'll normally see it used more often with forms and things that have lots of stuff to send. The difficulty is that it's a little harder to get the data that's been sent in. What happens is that when the POST method is called, all the data is gathered up and sent to Standard Input (STDIN). While it seems that it would then be just as easy to call up STDIN as it is to call up QUERY_STRING, it's not. STDIN is a really big buffer, and you don't want to be reading everything that might be contained in there-you might run out of space!
To help you out, the environment variable CONTENT_LENGTH tells you how much data was placed into STDIN. If there were 500 bytes, CONTENT_LENGTH will be the value 500. If there were 10 bytes, CONTENT_LENGTH will be the value 10. What this allows you to do is use your programming language's easiest method to read that number of bytes of data from STDIN and then do something with it.
When you get hold of the data, from either QUERY_STRING or STDIN, you may notice that it looks kind of strange. For instance, you might end up with a long string of data that looks like this:
What you've run into are the two steps that the client and server run before giving you access to the data. You should really thank them for doing it because it's designed to remove possible "problem" characters that could make the CGI program misbehave and also to organize everything into one convenient group. Let's look at what it has really done.
To try to help you get organized, the behind-the-scenes CGI mechanisms have arranged everything into pairs of information, separated by ampersands (&). If you've ever seen an HTML form (and you definitely will in Chapter 8, "Forms and How to Handle Them"), you may be familiar with the fact that each possible area where information can be entered has a name associated with it. For instance, a form might have fields for "name", "company", "Email", and "stuff". This helps the CGI program make sense of what information comes from where. You wouldn't want it to start confusing e-mail addresses and names, would you?
What happens is that every piece of data that can have a name associated with it does. This all happens automatically; you just see the results. This kind of formatting can be called "Name=Value pairs," but a more common term is "ordered pairs" because it's ordered in the Name=Value fashion, and whatever data is sent first is the first pair in the order. So if the "Email" field was first, the order would be email=bills&name=... and so on, until all the fields and their information had been accounted for.
The other process that's already happened to the data as it comes in is called URL encoding, or escaping, of special characters. The reason this has been done is to prevent any accidental interpretation of characters like percent signs (%), backslashes (\), and other pieces that would toss the server or the CGI program for a loop.
So how do you know when things have been encoded and what's a special character? Well, anything that has an ASCII value greater than 127 or lower than 33 is going to get encoded. But what the heck does that mean? Knowing that most of us haven't memorized the ASCII character table (because it's not something you really need to know during parties or casual conversation), all that you really need to know is this: Anything that's in the format %## (such as %25) is a special character that's been encoded. How do you know someone didn't accidentally put a percent sign (%) in a string and cause confusion? Because when percent signs are used as part of the information being sent by the user, the percent signs get encoded, too. In fact, %25 is actually the percent sign when it's encoded.
You might wonder, though, what kind of special characters show up in the data that aren't encoded because they mean something special. For instance, let's look at the sample data shown previously:
You can see the %25 there
in the end, which means there was originally a percent sign there
that got encoded. But what about the plus signs (+),
the equal sign (=), and the
ampersand (&)? They all
have special reserved functionality, as you see in Table 3.4.
Each one signifies that they're either a break in the data or
a special piece of encoding.
|Ampersand||Joins ordered pairs together.|
|Equal||Separates pair names from values.|
|Percent||Marks the beginning of an encoded character.|
|Plus||Substitutes for space.|
The plus sign (+) is kind of strange because %20 also means that there should be a space, but because spaces are so common, it looks nicer to just have a little + sign instead of %20 over and over again.
By now you might be thinking "Okay, so it's encoded; but as what? And how do I decode it to make sense of it all?" What it's encoded as is easy: hexadecimal. How you decode it, well, that's a little bit more involved. What you want to do is break up the data into individual ordered pairs, which means that every time you see an ampersand (&), you want to make a new pair. Then you want to split the ordered pairs into the Name and the Value by breaking it apart at the equal sign (=). Next, you want to substitute spaces for any of the plus signs (+) you see. Now you're ready to use whatever method your programming language makes available to convert all the special %## characters into their real values.
There are a large number of data-processing libraries that will do all that for you, so you never even have to worry about it. All you do is insert a line that calls the other function and let it do all the work for you.
When people start a conversation, you normally respond-unless you're ignoring them. With conversations between a user and your CGI program, it's important to make sure that you do something to let the user know the conversation is over, preferably without slamming the connection closed on them. So how do you eloquently end the conversation? It all depends on what you want to say in closing.
There are a lot of good reasons to generate output that gets sent back to the user. Normally, the whole purpose of the application is to obtain that information and then send it along, as with what happens when people use a search engine. In this case, the general idea is that the program has accomplished the mission it was assigned by evaluating the submitted data and coming up with something useful in return, and it's ready to call it a day.
Fortunately, or unfortunately, users can't stop the CGI program once it decides to generate the output and cease and desist; they can only grumble and restart the process all over again. To prevent them from having to do that needlessly, it's important for CGI programmers to make sure that output is carefully thought through so there are no surprises.
Normally, output falls into three classes: successful, not successful, and something else. Successful is the kind of result you get back when a search engine finds some matches to your inquiry and presents them to you in what it hopes is an orderly fashion. Not successful is pretty self-explanatory; it means that something went wrong, and you're not going to get what you're looking for. Something else, well, there are a lot of things that the server can do. It can start sending out a binary file, it can send the wrong type of output, it can even send you to another location entirely. In all cases, though, the output should be controlled by the CGI program because if it's not, that's a bigger problem.
Headers are what CGI programs use to preface data and say "Hey, I'm sending you..." so that the server knows what to do with it. There are three primary types of headers that servers return when used with CGI:
Each of these headers, regardless of type, is followed by a blank line to indicate to the server that this is a header, not the data, and that it's all done telling you about the header information.
You've had more Content-type headers directed at your browser than you can easily count. Every time an HTML file or an image comes in, they're preceded by a Content-type header that the server automatically passes along with each document. The number of possible types of data that can be passed back is pretty high because there are lots of different types of files out there. Most of these types are what's known as MIME (Multipurpose Internet Mail Extensions) types. MIME types are just generic classifications of documents and files that systems use to figure out what to do with them. By default, your HTML browser knows how to deal with the HTML type of content and normally images, as well.
The way these types of content are defined is through a combination of types and subtypes. The following are the seven basic MIME types for content:
Within each one of these types there are different subtypes, just like there are different brands and flavors of ice cream, even though it's the same basic stuff. When a Content-type header gets sent back, it specifies that it's a Content-type header, the MIME type, and the MIME subtype, and hopes that the client on the other end can figure out what the heck to do with it.
MIME types and subtypes crop up all the time, especially with new client plug-ins and other software ideas that people are putting into motion. To be safe, use a standard MIME type and subtype wherever possible; otherwise, you're bound to get weird results.
Some of the more common type/subtype combinations are
Now, if a CGI program wants to send you back HTML, it's going to have to tell your browser (and the server) that it's definitely sending back something as text/html. The way it would do that in a language like Perl would be as follows:
print "Content-type: text/html \n\n";
print "<h1>Hi there</h1> \n";
All that this does is send the "Content-type: text/html" definition, followed by a blank line (\n means "start a new line" in Perl; two \n symbols are needed to get a new line followed by a blank line), and then the HTML. Pretty easy, huh?
If the CGI program doesn't really want to create a whole new pile of HTML to send back to the user, it can do something else: point them to a different location. That's right, a CGI program can instruct your browser to go to a new location by specifying a Location header. This is how random link programs work: You start the CGI script, the random link program reads a bunch of possible sites from a database, picks one, and sends back a Location header to your browser saying "Go here."
Location headers are even easier to use than Content-type headers. All you have to do is specify where to go. Listing 3.1 shows an example in Perl.
Listing 3.1. Returning a location header in Perl.
print "Location: http://there.com/file.html \n\n";
Again, the header has a blank line under it to show that it's a header, is special, and should be dealt with first. So, if the program doesn't have anything useful to say, it can just send your browser off somewhere else and hope for the best.
If something goes wrong with the CGI program, it has the option
to let you know. Wouldn't that be nice? A Status header
is just an easy way of saying, "Okay, this happened, and
you know what to tell the user." What kinds of status codes
are there, and what do they mean? Table 3.5 takes a look at some
of the common ones.
|200||OK||Request worked just fine; no problems.|
|202||Accepted||The request is still being processed but was accepted.|
|301||Moved||The document has been moved to a new location.|
|302||Found||The document isn't where it was specified to be, but it's been found somewhere else on the server.|
|400||Bad Request||The syntax of the HTTP request wasn't right.|
|401||Unauthorized||The document requires privileges to get.|
|403||Forbidden||Server denied access to the document.|
|404||Not Found||The server couldn't find any such document.|
|500||Server Error||The server ran into big trouble.|
|502||Service Overloaded||The server is too busy to help you.|
Some of these status codes occur when there's a problem in the CGI script. See Chapter 6, "Testing and Debugging," for more details on the possible causes if you're the one designing the CGI program.
With all the things you've seen about CGI programs, you may start to wonder just how many uses you could possibly think of for them-probably quite a lot. The best way to get an idea of what kinds of things comprise a cool CGI application you can build, and how they work, is to look around. When you find something that you don't quite understand how it works, you can often talk to the people who made it and see if they'll tell you how they did it. Plenty of people make that kind of stuff available at no charge to other people because it's how they learned in the first place-people just helping out because they can.
One of the biggest things that keeps people from running right out and programming their own CGI application is that they think it'll be difficult. In some cases, they're right; but in most cases, the amount of difficulty is exaggerated. Sure, the first time out you won't want to try to create a topic relevance-based search engine, or something else requiring a lot of programming knowledge, but the point isn't to frustrate yourself when you're just getting started. Take your time to become familiar with how CGI works and what things you want to do with it, and many things will start to come naturally.
CGI programs are everywhere because they help people do things they couldn't previously do with their Web server. Having seen how servers can be taught to get help when they need it, you're now ready for the higher mysteries of CGI. Remember, though, that no matter how complex CGI may seem at times, it's just a matter of establishing a conversation between the server (who's been contacted by someone who needs something done) and the CGI program (who's going to do something about it). There's always a real, basic reason that something works the way it does and something familiar you can relate it to. You just have to find the connection.