******************************************************************************
Extensions of the XML specification
******************************************************************************
==============================================================================
This document
==============================================================================
This parser has some options extending the XML specification. Here, the options
are explained.
==============================================================================
Optional declarations instead of mandatory declarations
==============================================================================
The XML spec demands that elements, notations, and attributes must be declared.
However, there are sometimes situations where a different rule would be better:
If there is a declaration, the actual instance of the element type, notation
reference or attribute must match the pattern of the declaration; but if the
declaration is missing, a reasonable default declaration should be assumed.
I have an example that seems to be typical: The inclusion of HTML into a meta
language. Imagine you have defined some type of "generator" or other tool
working with HTML fragments, and your document contains two types of elements:
The generating elements (with a name like "gen:xxx"), and the object elements
which are HTML. As HTML is still evolving, you do not want to declare the HTML
elements; the HTML fragments should be treated as well-formed XML fragments. In
contrast to this, the elements of the generator should be declared and
validated because you can more easily detect errors.
The following two processing instructions can be included into the DTD:
-
References to unknown element types and notations no longer cause an error.
The element may contain everything, but it must be still well-formed. It
may have arbitrary attributes, and every attribute is treated as an
#IMPLIED CDATA attribute.
-
References to unknown attributes inside one of the enumerated elements no
longer cause an error. Such an attribute is treated as an #IMPLIED CDATA
attribute.
If there are several "optional-attribute-declarations" PIs, they are all
interpreted (implicitly merged).
==============================================================================
Normalized namespace prefixes
==============================================================================
The XML standard refers to names within namespaces as expanded names. This is
simply the pair (namespace_uri, localname); the namespace prefix is not
included in the expanded name.
PXP does not support expanded names, but it does support namespaces. However,
it uses a model that is slightly different from the usual representation of
names in namespaces: Instead of removing the namespace prefixes and converting
the names into expanded names, PXP prefers it to normalize the namespace
prefixes used in a document, i.e. the prefixes are transformed such that they
refer uniquely to namespaces.
The following text is valid XML:
The first element has the expanded name (namespace1,a) while the second element
has the expanded name (namespace2,a); so the elements have different types. As
already pointed out, PXP does not support the expanded names directly.
Alternatively, the XML text is transformed while it is being parsed such that
the prefixes become unique. In this example, the transformed text would read:
From a programmers point of view, this transformation has the advantage that
you need not to deal with pairs when comparing names, as all names are still
simple strings: here, "x:a", and "x1:a". However, the transformation seems to
be a bit random. Why not "y:a" instead of "x1:a"? The answer is that PXP allows
the programmer to control the transformation: You can simply demand that
namespace1 must use the normalized prefix "x", and namespace2 must use "y". The
declaration which normalized prefix to use can be programmed (by setting the
namespace_manager object), and it can be included into the DTD:
There is another advantage of using normalized prefixes: You can safely refer
to them in DTDs. For example, you could declare the two elements as
These declarations are applicable even if the XML text uses different prefixes,
because PXP normalizes any prefixes for namespace1 or namespace2 to the
preferred prefixes "x" and "y".
Since PXP-1.1.95, the namespace support has been extended. In addition to
prefix normalization, the parser now also stores the scoping structure of the
namespaces (in the namespace_scope objects). More or less, this means that the
parser remembers which elements have which "xmlns" attributes. There are two
important applications of this feature:
First, it is now possible to look up the namespace URI when only the original,
non-normalized namespace prefix is known. A number of XML standards, e.g.
XSchema, use namespace prefixes within data nodes. Of course, these prefixes
are not normalized by PXP, but simply remain as they are when the XML text is
parsed. To get the URI of such a prefix p in the context of node n, just call
n # namespace_scope # uri_of_display_prefix p
In PXP terminology, the non-normalized prefixes are now called "display
prefixes".
The other application is that it is now even possible to retrieve the original
"display" prefix of node names, e.g.
n # display_prefix
returns it. However, the display prefix is only guessed in the sense that when
there are several prefixes bound to the same URI, one of the prefixes may be
taken. For instance, in
both "x" and "y" are bound to the same URI "sample", and the display_prefix
method selects now one of the prefixes at random.
It is now even possible to output the parsed XML text with original namespace
structure: The "display" method outputs XML text where the namespaces are
declared as in the original XML text.
Regarding the "xmlns" attributes, PXP treats them in a very special way. It is
not only allowed not to declare them in the DTD, such declarations would be
even not applied to the actual "xmlns" attributes. For example, it is not
possible to declare a default value for "xmlns:x", as in
The default value would be ignored. Furthermore, it is not possible to declare
"xmlns" attributes as being required - validation will always fail even if the
"xmlns" attribute is present.
The model behind this treatment is defined by the "XML information set"
standard. There are two kinds of attributes: normal attributes, and namespace
attributes. PXP validates only normal attributes.