Now that we’ve looked at setup issues, it’s time to get into
concrete programming details. This section is a tutorial that introduces
CGI coding one step at a time—from simple, noninteractive scripts to
larger programs that utilize all the common web page user input devices
(what we called widgets in the tkinter GUI chapters in
Part III
).
Along the way, we’ll also explore the core ideas behind server-side
scripting. We’ll move slowly at first, to learn all the basics; the next
chapter will use the ideas presented here to build up larger and more
realistic website examples. For now, let’s work through a simple CGI
tutorial, with just enough HTML thrown in to write basic server-side
scripts.
As mentioned, CGI scripts
are intimately bound up with
HTML, so let’s start with a simple HTML page. The file
tutor0.html
, shown in
Example 15-2
, defines a bona fide,
fully functional web page—a text file containing HTML code, which
specifies the structure and contents of a simple web page.
Example 15-2. PP4E\Internet\Web\tutor0.html
HTML 101 A First HTML Page
Hello, HTML World!
If you point your favorite web browser to the Internet address of
this file, you should see a page like that shown in
Figure 15-2
. This figure shows the
Internet Explorer browser at work on the address
http://localhost/tutor0.html
(type this into your
browser’s address field), and it assumes that the local web server
described in the prior section is running; other browsers render the
page similarly. Since this is a static HTML file, you’ll get the same
result if you simply click on the file’s icon on most platforms, though
its text won’t be delivered by the web server in this mode.
Figure 15-2. A simple web page from an HTML file
To truly understand how this little file does its work, you need
to know something about HTML syntax, Internet addresses, and file
permission rules. Let’s take a quick first look at each of these topics
before we move on to the next example.
I promised that I wouldn’t
teach much HTML in this book, but you need to know
enough to make sense of examples. In short, HTML is a descriptive
markup language, based on
tags
— items enclosed in<>
pairs. Some tags stand
alone (e.g.,
specifies a
horizontal rule). Others appear in begin/end pairs in which the end
tag includes an extra slash.
For instance, to specify the text of a level-one header line, we
write HTML code of the formtext
; the text between the
tags shows up on the web page. Some tags also allow us to specify
options (sometimes called attributes). For example, a tag pair liketext
specifies
a
hyperlink
: pressing the link’s
text in the page directs the browser to access the Internet address
(URL) listed in thehref
option.
It’s important to keep in mind that HTML is used only to
describe pages: your web browser reads it and translates its
description to a web page with headers, paragraphs, links, and the
like. Notably absent are both
layout
information
—the browser is responsible for arranging
components on the page—and syntax for
programming
logic
—
there are noif
statements, loops, and so on.
Also, Python code is nowhere to be found in
Example 15-2
; raw HTML is strictly
for defining pages, not for coding programs or specifying all user
interface details.
HTML’s lack of user interface control and programmability is
both a strength and a weakness. It’s well suited to describing pages
and simple user interfaces at a high level. The browser, not you,
handles physically laying out the page on your screen. On the other
hand, HTML by itself does not directly support full-blown GUIs and
requires us to introduce CGI scripts (or other technologies such as
RIAs) to websites in order to add dynamic programmability to otherwise
static HTML.
Once you write an
HTML file, you need to put it somewhere a web browser
can reference it. If you are using the locally running Python web
server described earlier, this becomes trivial: use a URL of the form
http://localhost/file.html
to access web pages,
and
http://localhost/cgi-bin/file.py
to name CGI
scripts. This is implied by the fact that the web server script by
default serves pages and scripts from the directory in which it is
run.
On other servers, URLs may be more complex. Like all HTML files,
tutor0.html
must be stored in a directory on the
server machine, from which the resident web server program allows
browsers to fetch pages. For example, on the server used for the
second edition of this book, the page’s file must be stored in or
below the
public_html
directory of my personal
home directory—that is, somewhere in the directory tree rooted at
/home/lutz/public_html
. The
complete Unix pathname of this file on the server is:
/home/lutz/public_html/tutor0.html
This path is different from its
PP4E\Internet\Web
location in the book’s examples
distribution, as given in the example file listing’s title. When
referencing this file on the client, though, you must specify its
Internet address, sometimes called a URL, instead of a directory path
name. The following URL was used to load the remote page from the
server:
http://starship.python.net/~lutz/tutor0.html
The remote server maps this URL to the Unix pathname
automatically, in much the same way that the
http://localhost
resolves to the examples
directory containing the web server script for our locally-running
server. In general, URL strings like the one just listed are composed
as the concatenation of multiple parts:
The protocol part of this URL tells the browser to
communicate with the HTTP (i.e., web) server program on the
server machine, using the HTTP message protocol. URLs used in
browsers can also name different protocols—for example,
ftp://
to reference a file managed by the
FTP protocol and server,
file://
to
reference a file on the local machine,
telnet
to start a Telnet client session,
and so on.
A URL also names the target server machine’s domain name
or Internet Protocol (IP) address following the protocol type.
Here, we list the domain name of the server machine where the
examples are installed; the machine name listed is used to open
a socket to talk to the server. As usual, a machine name of
localhost
(or the equivalent IP address
127.0.0.1
) here means the server is running
on the same machine as the client.
Optionally, this part of the URL may also explicitly give
the socket port on which the server is listening for
connections, following a colon (e.g.,
starship.python.net:8000
, or
127.0.0.1:80
). For HTTP, the socket is
usually connected to port number 80, so this is the default if
the port is omitted. See
Chapter 12
if
you need a refresher on machine names and ports.
~
lutz/tutor0.htmlFinally, the URL gives the path to the desired file on the
remote machine. The HTTP web server automatically translates the
URL’s file path to the file’s true pathname: on the starship
server,~lutz
is
automatically translated to the
public_html
directory in my home
directory. When using the Python-coded web server script in
Example 15-1
, files are
mapped to the server’s current working directory instead. URLs
typically map to such files, but they can reference other sorts
of items as well, and as we’ll see in a few moments may name an
executable CGI script to be run when accessed.
URLs may also be followed by additional input
parameters for CGI programs. When used, they are
introduced by a?
and are
typically separated by&
characters. For instance, a string of the form?name=bob&job=hacker
at the end of
a URL passes parameters namedname
andjob
to the CGI script named earlier in
the URL, with valuesbob
andhacker
, respectively. As
we’ll discuss later in this chapter when we explore escaping
rules, the parameters may sometimes be separated by;
characters instead, as in?name=bob;job=hacker
, though this form
is less common.
These values are sometimes called URL
query
string parameters
and are treated the same as form
inputs by scripts. Technically speaking, query parameters may
have other structures (e.g., unnamed values separated by+
), but we will ignore
additional options in this text; more on both parameters and
input forms later in this tutorial.
To make sure we have a handle on URL syntax, let’s pick apart
another example that we will be using later in this chapter. In the
following HTTP protocol URL:
http://localhost:80/cgi-bin/languages.py?language=All
the components uniquely identify a server script to be run as
follows:
The server namelocalhost
means the web server is running on the same machine as the client;
as explained earlier, this is the configuration we’re using for
our
examples
.
Port number 80 gives the socket port on which the web server
is listening for connections (port 80 is the default if this part
is omitted, so we will usually omit it).
The file pathcgi-bin/languages.py
gives the location
of the file to be run on the server machine, within the directory
where the server looks for referenced files.
The query string?language=All
provides an input
parameter to the referenced scriptlanguages.py
, as an alternative to user
input in form fields (described later).
Although this covers most URLs you’re likely to encounter in the
wild, the full format of URLs is slightly richer:
protocol://networklocation/path;parameters?querystring#fragment
For instance, thefragment
part may name a section within a page (e.g.,#part1
). Moreover, each part can have
formats of its own, and some are not used in all protocols. The;parameters
part is omitted for
HTTP, for instance (it gives an explicit file type for FTP), and thenetworklocation
part may also
specify optional user login parameters for some protocol schemes (its
full format isuser:password@host:port
for FTP and Telnet,
but justhost:port
for HTTP). We
used a complex FTP URL in
Chapter 13
,
for example, which included a username and password, as well as a
binary file type (the server may guess if no type is given):
ftp://lutz:[email protected]/filename;type=i
We’ll ignore additional URL formatting rules here. If you’re
interested in more details, you might start by reading theurllib.parse
module’s entry in Python’s library manual, as well as
its source code in the Python standard library. You may also notice
that a URL you type to access a page looks a bit different after the
page is fetched (spaces become+
characters,%
characters are added,
and so on). This is simply because browsers must also generally follow
URL escaping (i.e., translation) conventions, which we’ll explore
later in this
chapter.
Because
browsers remember the prior page’s Internet address,
URLs embedded in HTML files can often omit the protocol and server
names, as well as the file’s directory path. If missing, the browser
simply uses these components’ values from the last page’s address.
This minimal syntax works for URLs embedded in hyperlinks and for form
actions (we’ll meet forms later in this tutorial). For example, within
a page that was fetched from the directory
dirpath
on the server
http://www.server.com
, minimal hyperlinks and form
actions such as:
are treated exactly as if we had specified a complete URL with
explicit server and path components, like the following:
The first minimal URL refers to the file
more.html
on the same server and in the same
directory from which the page containing this hyperlink was fetched;
it is expanded to a complete URL within the browser. URLs can also
employ Unix-style relative path syntax in the file path component. A
hyperlink tag likeHREF="../spam.gif">
, for instance, names a GIF file on
the server machine and parent directory of the file that contains this
link’s URL.
Why all the fuss about shorter URLs? Besides extending the life
of your keyboard and eyesight, the main advantage of such minimal URLs
is that they don’t need to be changed if you ever move your pages to a
new directory or server—the server and path are inferred when the page
is used; they are not hardcoded into its HTML. The flipside of this
can be fairly painful: examples that do include explicit site names
and pathnames in URLs embedded within HTML code cannot be copied to
other servers without source code changes. Scripts and special HTML
tags can help here, but editing source code can be error-prone.
The downside of minimal URLs is that they don’t trigger
automatic Internet connections when followed offline. This becomes
apparent only when you load pages from local files on your computer.
For example, we can generally open HTML pages without connecting to
the Internet at all by pointing a web browser to a page’s file that
lives on the local machine (e.g., by clicking on its file icon). When
browsing a page locally like this, following a fully specified URL
makes the browser automatically connect to the Internet to fetch the
referenced page or script. Minimal URLs, though, are opened on the
local machine again; usually, the browser simply displays the
referenced page or script’s source code.
The net effect is that minimal URLs are more portable, but they
tend to work better when running all pages live on the Internet (or
served up by a locally running web server). To make them easier to
work with, the examples in this book will often omit the server and
path components in URLs they contain. In this book, to derive a page
or script’s true URL from a minimal URL, imagine that the
string:
http://localhost/
appears before the filename given by the URL. Your browser will,
even if you don’t.