XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (672 page)

BOOK: XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition
11.24Mb size Format: txt, pdf, ePub

bold and emphatic statement

An example of a well-formed document that is not a well-formed external general parsed entity (because it contains a
standalone
attribute) is:


 and emphatic statement


The rules for document entities and external general parsed entities overlap, as shown in
Figure 15-1
.

Essentially, an XSLT stylesheet can output anything that fits in either of the two shaded circles, which means anything that is a well-formed XML document entity, a well-formed external general parsed entity, or both.

Well, almost anything:

  • It must also conform to the XML Namespaces Recommendation.
  • There is no explicit provision for generating an internal DTD subset, although it can be achieved, with difficulty, by using character maps.
  • Similarly, there is no explicit provision for generating entity references, though this can also be achieved by means of character maps.

In the XML standard, the rules for an external parsed entity are given as:

extParsedEnt ⇒ TextDecl? content

where
content
is a sequence of components including child elements, character data, entity references, CDATA sections, processing instructions, and comments, each of which may appear any number of times and in any order.

The corresponding rule for a document entity is effectively:

document ⇒ XMLDecl? Misc* doctypedecl? Misc* element Misc*

where
Misc
permits whitespace, comments, and processing instructions.

So the principal differences between the two cases are:

  • A
    TextDecl
    (text declaration) is not quite the same thing as an
    XMLDecl
    (XML declaration), as discussed below.
  • A document may contain a
    doctypedecl
    (document type declaration), but an external parsed entity must not. A document type declaration is
    the

    header identifying the DTD and possibly including an internal DTD subset.
  • The body of a document is an
    element
    , while the body of an external parsed entity is
    content
    . Here
    content
    is effectively the contents of an element but without the start and end tags.
  • In a document, any whitespace that immediately follows the XML declaration is insignificant. In an external general parsed entity, however, such whitespace is significant. This means that the serializer cannot add whitespace here unless it is explicitly requested.

The
TextDecl
(text declaration) looks at first sight very much like an XML declaration; for example,

could be used either as an XML declaration or as a text declaration. There are differences, however:

  • In an XML declaration, the
    version
    attribute is mandatory, but in a text declaration it is optional.
  • In an XML declaration, the
    encoding
    attribute is optional, but in a text declaration it is mandatory.
  • An XML declaration may include a
    standalone
    attribute, but a text declaration must not.

So the following are all examples of well-formed external general parsed entities:

1: Hello!

2: Hello!Goodbye!

3: Hello!

4: Hello!

The following is a well-formed XML document, but it is
not
a well-formed external general parsed entity, because of both the
standalone
attribute and the document type declaration. This is also legitimate output:



Hello!

The following is neither a well-formed XML document nor a well-formed external general parsed entity.



Hello!

Goodbye!

It cannot be an XML document because it has more than one top-level element, and it cannot be an external general parsed entity because it has a

declaration. A stylesheet attempting to produce this output is in error. The XSLT specification also places two other constraints on the form of the output, although these are rules for the implementor to follow rather than rules that directly affect the stylesheet author. These rules are:

  • The output must conform to the rules of the XML Namespaces Recommendation. If the output is XML 1.0, then it must conform to XML Namespaces 1.0, and if it is XML 1.1, then it must conform to XML Namespaces 1.1.

    If the output is an XML document, the meaning of this is clear enough, but if it is merely an external entity, some further explanation is needed. The standard provides this by saying that when the entity is pulled into a document by adding an element tag around its content, the resulting document must conform with the XML Namespaces rules.

  • The output file must faithfully reflect the result tree. This requirement is easy to state informally, but the specification includes a more formal statement of the requirement, which is surprisingly complex.

    The rule is expressed by describing what may change when the data model is serialized to XML and then parsed again to create a new data model. Things that may change include the order of attributes and namespace nodes and the base URI. If the parsing stage does DTD or schema validation, then this may cause new attribute or element values to appear, as specified by the DTD or schema. Perhaps the most significant change is that type annotations on element and attribute nodes will not be preserved. Any type annotations in the data model after such a round trip will be based on revalidation of the textual XML document; the type annotations in the original result tree are lost during serialization.

Serialization never adds attributes such as
xml:base
or
xsi:type
. If you want these present in the output, you must put them in the result tree, just like any other attribute.

Between XSLT 1.0 and XSLT 2.0, there is a change in the way the rules concerning namespace declarations are described. In XSLT 1.0, it was the job of the serializer to generate namespace declarations, not only for namespace nodes explicitly present on the result tree, but also for any namespaces used in the result document, but not represented by namespace nodes. This situation can happen because there is nothing in the rules for

and

, for example, that requires namespace nodes to be created for the namespace URI used in the names of these nodes. In XSLT 2.0, however, the specification describes a process called namespace fixup, which ensures that an element in a result tree always has a namespace node for every namespace that is used either in the element name or in the name of any of its attributes. This means that it is no longer the responsibility of the serializer to create these namespace declarations. The reason for this change is that with XSLT 2.0, the content of temporary trees (including their namespace nodes) becomes visible to the stylesheet, and the namespace nodes need to be present in the tree to make it usable for further processing. Namespace fixup is described in Chapter 6 under

on page 306.

Although the output is required to be well-formed XML, it is not the job of the serializer to ensure that the XML is valid against either a DTD or a schema. Just because you generate a document type declaration that refers to a specific DTD, or a reference to a schema, don't expect the XSLT processor to check that the output document actually conforms to that DTD or schema. Instead, XSLT 2.0 provides facilities to validate the result tree against a schema before it is serialized, by using the
validation
or
type
attribute on

.

With the
xml
output method, the other attributes of

or

are interpreted as follows. Attributes that are not applicable to this output method are not included in the table, and they are ignored if you specify them.

The W3C serialization specification speaks of
doctype-system
,
doctype-public
and so on as serialization parameters. This is because the specification is designed to be used independently of XSLT, and therefore abstracts away from the actual XSLT syntax used. In this book we're concerned with how to control serialization from XSLT, so we'll refer to these things as attributes, which is how they appear in XSLT's concrete syntax.

Attribute
Interpretation
cdata-section-elements
This is a list of element names, each expressed as a lexical QName, separated by whitespace. Any prefix in a QName is treated as a reference to the corresponding namespace URI in the normal way, using the namespace declarations in effect on the actual

or

element where the
cdata-section-elements
attribute appears. Because these are element names, the default namespace is assumed where the name has no prefix. When a text node is output, if the parent element of the text node is identified by a name in this list, then the text node is output as a
CDATA
section. For example, the text value
James
is output as

, and the text value
AT&T
is output as

. Otherwise, this value would probably be output as
AT&T
. The XSLT processor is free to choose other equivalent representations if it wishes, for example a character reference, but the standard says that it should not use CDATA unless it is explicitly requested.
The CDATA section may be split into parts if necessary, perhaps when the terminator sequence
]]>
appears in the data, or when there is a character that can only be output using a character reference because it is not supported directly in the chosen encoding.
doctype-system
If this attribute is specified, the output file will include a document type declaration (that is,

) after the XML declaration and before the first element start tag. The name of the document type will be the same as the name of the first element. The value of this attribute will be used as the system identifier in the document type declaration. This attribute should not be used unless the output is a well-formed XML document.
doctype-public
This attribute is ignored unless the
doctype-system
attribute is also specified. It defines the value of the public identifier to go in the document type declaration. If no public identifier is specified, none is included in the document type declaration.
encoding
This specifies the preferred character encoding for the output document. All XSLT processors are required to support the values
UTF-8
and
UTF-16
(which are also the only values that XML parsers are required to support). This encoding name will be used in the encoding attribute of the XML or Text declaration at the start of the output file, and all characters in the file will be encoded using these conventions. The standard encoding names are not case-sensitive.
If the encoding is one that does not allow all XML characters to be represented directly, for example
iso-8859-1
, then characters outside this subset will be represented where possible using XML character references (such as

). It is an error if such characters appear in contexts where character references are not recognized (for example within a processing instruction or comment, or in an element or attribute name).
If the result tree is serialized to a destination that expects a stream of Unicode characters rather than a stream of bytes, then the
encoding
attribute is ignored. This happens, for example, if you send the output to a Java
Writer
. It often causes confusion when you use the
transformNode()
method in Microsoft's MSXML API, which returns the serialized result of the transformation as a string: This is a value of type
BSTR
, which is encoded in UTF-16 regardless of the encoding you requested in the stylesheet.

Other books

Reversing Over Liberace by Jane Lovering
Bad Nerd Falling by Grady, D.R.
A Drop of Red by Chris Marie Green
Archer's Sin by Amy Raby
The Gypsy Morph by Terry Brooks
Bear by Ellen Miles
El dragón de hielo by George R. R. Martin
The Laird's Daughter by Temple Hogan