Overview
Schema Basics
Working with Namespaces
Complex Types
Empty Elements
Simple Content
Mixed Content
Allowing Any Content
Controlling Type Derivation
Although Document Type Definitions can enforce basic structural rules on documents, many applications need a more powerful and expressive validation method. The W3C developed the XML Schema Recommendation, released on May 2, 2001 after a long incubation period, to address these needs. Schemas can describe complex restrictions on elements and attributes. Multiple schemas can be combined to validate documents that use multiple XML vocabularies. This chapter provides a rapid introduction to key W3C XML Schema concepts and usage.
This chapter progressively introduces the structures and concepts of XML Schemas, beginning with the fundamental structure that is common to all schemas. The chapter begins with a very simple schema and proceeds to add more functionality to it until ever major feature of XML Schemas has been introduced.
A schema is a formal description of what comprises a valid document. An XML schema is an XML document containing a formal description of what comprises a valid XML document. A W3C XML Schema Language schema is an XML schema written in the particular syntax recommended by the W3C.
TIP: In this chapter when we use the word schema without further qualification, we are referring specifically to a schema written in the W3C XML schema language. However, there are numerous other XML schema languages, including RELAX NG and Schematron, each with their own strengths and weaknesses.
An XML document described by a schema is called an instance document. If a document satisfies all the constraints specified by the schema, it is considered to be schema-valid. The schema document is associated with an instance document through one of the following methods:
An xsi:schemaLocation attribute on an element contains a list of namespaces used within that element and the URLs of the schemas with which to validate elements in those namespaces.
An xsi:noNamespaceSchemaLocation attribute contains a URL for the schema used to validate elements that are not in any namespace.
The validating parser may attempt to locate the schema using the namespace of the element itself in one of these ways: directly by looking for a schema at that namespace, indirectly by looking for a RDDL document at that namespace, or implicitly by knowing in advance which schema is right for that namespace.
A validating parser may be instructed to validate a given document against an explicitly provided schema, ignoring any hints that might be provided within the document itself.
DTDs provide the capability to do basic validation of the following items in XML documents:
Element nesting
Element occurrence constraints
Permitted attributes
Attribute types and default values
However, DTDs do not provide fine control over the format and data types of element and attribute values. Other than the various special attribute types (ID, IDREF, ENTITY, NMTOKEN, and so forth), once an element or attribute has been declared to contain character data, no limits may be placed on the length, type, or format of that content. For narrative documents (such as web pages, book chapters, newsletters, etc.), this level of control is probably good enough.
But as XML makes inroads into more data-intensive applications (such as web services using SOAP), more precise control over the text content of elements and attributes becomes important. The W3C XML Schema standard includes the following features:
Simple and complex data types
Type derivation and inheritance
Element occurrence constraints
Namespace-aware element and attribute declarations
The most important of these features is the addition of simple data types for parsed character data and attribute values. Unlike DTDs, schemas can enforce specific rules about the contents of elements and attributes. In addition to a wide range of built-in simple types (such as string, integer, decimal, and dateTime), the schema language provides a framework for declaring new data types, deriving new types from old types, and reusing types from other schemas.
Besides simple data types, schemas add the ability to place more explicit restrictions on the number and sequence of child elements that can appear in a given location. This is even true when elements are mixed with character data, unlike the mixed content model (#PCDATA) supported by DTDs.
WARNING: There are a few things that DTDs do that XML Schema can't do. Defining general entities for use in documents is one of these. XML Inclusions (XInclude) may be able to replace some uses of general entities, but DTDs remain extremely convenient for short entities.
As XML documents are exchanged between different people and organizations around the world, proper use of namespaces becomes critical to prevent misunderstandings. Depending on what type of document is being viewed, a simple element like <fullName>Zoe</fullName> could have widely different meanings. It could be a person's name, a pet's name, or the name of a ship that recently docked. By associating every element with a namespace URI, it is possible to distinguish between two elements with the same local name.
Because the Namespaces in XML recommendation was released after the XML 1.0 recommendation, DTDs do not provide explicit support for declaring namespace-aware XML applications. Unlike DTDs (where element and attribute declarations must include a namespace prefix), schemas validate against the combination of the namespace URI and local name rather than the prefixed name.
Namespaces are also used within instance documents to include directives to the schema processor. For example, the special attributes that are used to associate an element with a schema (schemaLocation and noNamespaceSchemaLocation) must be associated with the official XML Schema instance namespace URI (http://www.w3.org/2001/XMLSchema-instance) in order for the schema processor to recognize it as an instruction to itself.
Copyright © 2002 O'Reilly & Associates. All rights reserved.