HTML The Definitive Guide (68 page)

Read HTML The Definitive Guide Online

Authors: Chuck Musciano Bill Kennedy

BOOK: HTML The Definitive Guide
9.52Mb size Format: txt, pdf, ePub

%3C

<

Greater than sign

Unsafe

%3E

>

Double quotation mark

Unsafe

%22

"

Hash symbol

Unsafe

%23

#

Percent

Unsafe

%25

%

Left curly brace

Unsafe

%7B

{

Right curly brace

Unsafe

%7D

}

Vertical bar

Unsafe

%7C

|

Backslash

Unsafe

%5C

\

Caret

Unsafe

%5E

^

Tilde

Unsafe

%7E

~

Left square bracket

Unsafe

%5B

[

Right square bracket

Unsafe

%5D

]

Back single quotation mark Unsafe

%60

Ìn general, you should always encode a character if there is some doubt as to whether it can be placed as-is in a URL. As a rule of thumb, any character other than a letter, number, or any of the characters $-_.+!*'(), should be encoded.

It is never an error to encode a character, unless that character has a specific meaning in the URL. For example, encoding the slashes in an http URL will cause them to be used as regular characters, not as pathname delimiters, breaking the URL.

7.2.2 The http URL

The http URL is by far the most common within the World Wide Web. It is used to access documents stored on an http server, and it has two formats: http://
server
:
port
/
path
#
fragment
http://
server
:
port
/
path
?
search
Some of the parts are optional. In fact, the most common form of the http URL is simply: http://
server
/
path

designating the unique server and the directory path and name of a document.

7.2.2.1 The http server

The
server
is the unique Internet name or Internet Protocol (IP) numerical address of the computer system that stores the web resource. Like us, we suspect you'll mostly use more easily remembered Internet names for the servers in your URLs.[
3
] The name consists of several parts, including the server's actual name and the successive names of its network domain, each part separated by a period.

Typical Internet names look like
www.oreilly.com
or
hoohoo.ncsa.uiuc.edu
.[
4
]

[3] Each Internet-connected computer has a unique address; a numeric (IP) address, of course, because computers deal only in numbers. Humans prefer names, so the Internet folks provide us with a collection of special servers and software (Domain Name System or DNS) that automatically resolve Internet names into IP addresses. InterNIC, a nonprofit agency, registers domain names mostly on a first-come, first-serve basis, and distributes new names to DNS servers worldwide.

[4] In the United States and for some Canadian establishments, the three-letter suffix of the domain name identifies the type of organization or business that operates that portion of the Internet. For instance, "com" is a commercial enterprise; "edu" is an academic institution; and "gov" identifies a government-based domain. Outside the United States, a less-descriptive suffix is assigned; typically a two-letter abbreviation of the country name: "jp" for Japan and "de" for Deutschland, for instance. That convention indicates the traditional distribution of the Internet and presumably will change dramatically as the network proliferates in the rest of the world.

It has become something of a convention that webmasters name their servers
www
for quick and easy identification on the Web. For instance, O'Reilly & Associates' web server's name is
www
, which, along with the publisher's domain name, becomes the very easily remembered web site
www.oreilly.com
. Similarly, Sun Microsystems's web server is named
www.sun.com
; Apple Computer's is
www.apple.com
, and even Microsoft makes their web server easily memorable as
www.microsoft.com
. The naming convention has very obvious benefits, which you, too, should take advantage of if you are called upon to create a web server for your organization.

You may also specify the address of a server using its numerical IP address. The address is a sequence of four numbers, zero to 255, separated by periods. Valid IP addresses look like 137.237.1.87 or 192.249.1.33.

It'd be a dull diversion to tell you now what the numbers mean or how to derive an IP address from a domain name, particularly since you'll rarely if ever use one in a URL. Rather, this is a good place to hyperlink: pick up any good Internet networking treatise for rigorous detail on IP addressing, such as Ed Krol's
The Whole Internet User's Guide and Catalog
(O'Reilly & Associates).

7.2.2.2 The http port

The
port
is the number of the communication port to which the client browser connects to the server.

It's a networking thing: servers do many things besides serve up web documents and resources to client browsers: electronic mail, FTP document fetches, filesystem sharing, and so on. Although all that network activity may come into the server on a single wire, it's typically divided into software-managed "ports" for service-specific communications - something analogous to boxes at your local post office.

The default URL port for web servers is 80. Special secure web servers (Secure HTTP, SHTTP or Secure Socket Layer, SSL) run on port 443. Most web servers today use port 80; you need to include a port number along with an immediately preceding colon in your URL if the target server does
not
use port 80 for web communication.

When the Web was in its infancy many months ago, pioneer webmasters ran their Wild Wild Web connections on all sorts of port numbers. For technical and security reasons, system-administrator privileges are required to install a server on port 80. Lacking such privileges, these webmasters chose other, more easily accessible, port numbers.

Now that web servers have become acceptable and are under the care and feeding of responsible administrators, documents being served on some port other than 80 or 443 should make you wonder if that server is really on the up and up. Most likely, the maverick server is being run by a clever user unbeknownst to the server's bona fide system administrators.

7.2.2.3 The http path

The document
path
is the Unix-style hierarchical location of the file in the server's storage system.

The pathname consists of one or more names separated by slashes. All but the last name represent directories leading down to the document; the last name is usually that of the document itself.

It has become a convention that for easy identification, HTML document names end with the suffix
.html
(they're otherwise plain ASCII text files, remember?). Although Windows 95 and Windows NT

allow longer suffixes, their users usually stick to the three-letter
.htm
name suffix for HTML

documents.

Although the server name in a URL is not case-sensitive, the document pathname may be. Since most web servers are run on Unix-based systems and Unix file names are case-sensitive, the document pathname will be case-sensitive, too. Web servers running on Windows machines are not case-sensitive, so the document pathname is not, but since it is impossible to know the operating system of the server you are accessing, always assume that the server has case-sensitive pathnames and take care to get the case correct when typing your URLs.

Certain conventions regarding the document pathname have arisen. If the last element of the document path is a directory, not a single document, the server usually will send back either a listing of the directory contents or the HTML index document in that directory. You should end the document name for a directory with a trailing slash character, but in practice, most servers will honor the request even if the character is omitted.

If the directory name is just a slash alone or sometimes nothing at all, you will retrieve the first (top-level) HTML document or so-called
home page
in the uppermost root directory of the server.

Every well-designed http server should have an attractive, well-designed "home page"; it's a shorthand way for users to access your web collection since they don't need to remember the document's actual filename, just your server's name. That's why, for example, you can type
http://www.oreilly. com
into Netscape's "Open" dialog box and get O'Reilly's home page.

Another twist: if the first component of the document path starts with the tilde character (~), it means that the rest of the pathname begins from the personal HTML directory in the home directory of the specified user on the server machine. For instance, the URL
http://www.kumquat.com/~chuck /
would retrieve the top-level page from Chuck's document collection.

Different servers have different ways of locating documents within a user's home directory. Many search for the documents in a directory named
public_html
. Unix-based servers are fond of the name
index.html
for home pages.

7.2.2.4 The http document fragment

The
fragment
is an identifier that points to a specific section of a document. In URL specifications, it follows the server and pathname and is separated by the pound sign (#). A fragment identifier indicates to the browser that it should begin displaying the target document at the indicated fragment name. As we describe in more detail later in this chapter, you insert fragment names into a document with the tag and the name attribute. Like pathnames, a fragment name may be any sequence of characters.

The fragment name and the preceding hash symbol are optional; omit them when referencing a document without defined fragments.

Formally, the fragment element only applies to target files that are HTML documents. If the target of the URL is some other document type, the fragment name may be misinterpreted by the browser.

Fragments are useful for long documents. By identifying key sections of your document with a fragment name, you make it easy for readers to link directly to that portion of the document, avoiding the tedium of scrolling or searching through the document to get to the section that interests them.

As a rule of thumb, we recommend that every section header in your documents be accompanied by an equivalent fragment name. By consistently following this rule, you'll make it possible for readers to jump to any section in any of your documents. Fragments also make it easier to build tables of contents for your document families.

7.2.2.5 The http search parameter

The
search
component of the http URL, along with its preceding question mark, is optional. It indicates that the path is a searchable or executable resource on the server. The content of the search component is passed to the server as parameters that control the search or execution function.

The actual encoding of parameters in the search component is dependent upon the server and the resource being referenced. The parameters for searchable resources are covered later in this chapter, when we discuss searchable documents. Parameters for executable resources are discussed in
Chapter

10, Forms
.

Although our initial presentation of http URLs indicated that a URL can have either a fragment identifier or a search component, some browsers let you use both in a single URL. If you so desire, you can follow the search parameter with a fragment identifier, telling the browser to begin displaying the results of the search at the indicated fragment. Netscape, for example, supports this usage.

We don't recommend this kind of URL, though. First and foremost, it doesn't work on a lot of browsers. Just as important, using a fragment implies that you are sure that the results of the search will have a fragment of that name defined within the document. For large document collections, this is hardly likely. You are better off omitting the fragment, showing the search results from the beginning of the document, and avoiding potential confusion among your readers.

7.2.2.6 Sample http URLs

Here are some sample http URLs:

http://www.oreilly.com/catalog.html

http://www.oreilly.com/

http://www.kumquat.com:8080/

http://www.kumquat.com/planting/guide.html#soil_prep http://www.kumquat.com/find_a_quat?state=Florida The first example is an explicit reference to a bona fide HTML document named
catalog.html
that is stored in the root directory of the
www.oreilly.com
server. The second references the top-level home page on that same server. That home page may or may not be
catalog.html
. Sample three, too, assumes that there is a home page in the root directory of the
www.kumquat.com
server, and that the web connection is to the nonstandard port 8080.

The fourth example is the URL for retrieving the web document named
guide.html
from the
planting
directory on the
www.kumquat.com
server. Once retrieved, the browser should display the document beginning at the fragment named
soil_prep
.

The last example invokes an executable resource named
find_a_quat
with the parameter named
state
set to the value
Florida
. Presumably, this resource generates an HTML response that is subsequently displayed by the browser.

7.2.3 The javascript URL

The javascript URL actually is a pseudo-protocol, not usually included in discussions of URLs. Yet, with advanced browsers like Netscape and Internet Explorer, the javascript URL can be associated with a hyperlink and used to execute JavaScript commands when the user selects the link. Expect to
see many examples of link-related JavaScript effects in HTML documents on the Web. [JavaScript

URLs, 13.3.4]

7.2.3.1 The javascript URL arguments

Other books

Surrender to Love by J. C. Valentine
The Unscheduled Mission by Feinstein, Jonathan Edward
Miss Cresswell's London Triumph by Evelyn Richardson
Task Force Black by Mark Urban
A Royal Bennet by Melanie Schertz
We'll Meet Again by Mary Nichols
Bitter Medicine by Sara Paretsky
The Rustler's Bride by Tatiana March