XML Documents and XML Files
Elements, Tags, and Character Data
Attributes
XML Names
Entity References
CDATA Sections
Comments
Processing Instructions
The XML Declaration
Checking Documents for Well-Formedness
This chapter shows you how to write simple XML documents. You'll see that an XML document is built from text content marked up with text tags such as <SKU>, <Record_ID>, and <author> that look superficially like HTML tags. However, in HTML you're limited to about a hundred predefined tags that describe web-page formatting. In XML you can create as many tags as you need. Furthermore, these tags will mostly describe the type of content they contain rather than formatting or layout information. In XML you don't say that something is italicized or indented or bold; you say that it's a book or a biography or a calendar.
Although XML is looser than HTML in regards to which tags it allows, it is much stricter about where those tags are placed and how they're written. In particular, all XML documents must be well-formed. Well-formedness rules specify constraints such as "Every start-tag must have a matching end-tag" and "Attribute values must be quoted." These rules are unbreakable, which makes parsing XML documents easy and writing them a little harder, but they still allow an almost unlimited flexibility of expression.
An XML document contains text, never binary data. It can be opened with any program that knows how to read a text file. Example 2-1 is close to the simplest XML document imaginable. Nonetheless, t is a well-formed XML document. XML parsers can read it and understand it (at least as far as a computer program can be said to understand anything).
In the most common scenario, this document would be the entire contents of a file named person.xml, or perhaps 2-1.xml. However, XML is not picky about the filename. As far as the parser is concerned, this file could be called person.txt, person, or Hey you, there's some XML in this here file! Your operating system may or may not like these names, but an XML parser won't care. The document might not even be in a file at all. It could be a record or a field in a database. It could be generated on the fly by a CGI program in response to a browser query. It could even be stored in more than one file, though that's unlikely for such a simple document. If it is served by a web server, it will probably be assigned the MIME media type application/xml or text/xml. However, specific XML applications may use more specific MIME media types such as application/mathml+xml, application/XSLT+xml, image/svg+xml, text/vnd.wap.wml, or even text/html (in very special cases).
WARNING: For generic XML documents, application/xml should be preferred to text/xml, although most web servers use text/xml by default. text/xml uses the ASCII character set as a default, which is incorrect for most XML documents.
Copyright © 2002 O'Reilly & Associates. All rights reserved.