| << 18.2.0- A Closer Look at Creating XML Documents | Chapter18 | 18.2.2- Valid Documents >> |
Well-Formed Documents
The XML 1.0 specification defines the syntax for XML. If you understand the specification properly, you can construct a program that will be able to 'look' at a document that is supposed to be XML. If the document conforms to the specification for XML, then the program can do further processing on it. The idea underlying the XML specification is, therefore, that XML documents should be intelligible as such, either to humans or processing applications.
Being well-formed is the minimum set of requirements (defined in the specification) that a document needs to satisfy in order for it to be considered an XML document. Here, requirements are a mixture of ensuring that the correct language terms are employed and that the document is logically coherent in the manner defined by the specification (in other words that the terms of the language are used in the right way). You can see the XML specification at http://www.w3.org/tr/xml/ . There is also a helpful annotated version of the specification available at http://www.xml.com/axml/testaxml.htm.
So, what are these rules? You'll be pleased to hear that nearly everything we need to know about well-formed documents can be summed up in three rules:
- The document must contain one or more elements
- It must contain a uniquely named element, no part of which appears in the content of any other element. This is known as the root element
- All other elements must be kept within the root element and must be nested correctly
So, let's look at how we construct a well-formed document.
The XML Declaration
This is actually optional, although you are strongly advised to use it so that the receiving application knows that it is an XML document and also the version used (at the time of writing this was the only version).
<?xml version="1.0"?>
Note that the xml should be in lowercase. Note also that the XML declaration, when present, must not be preceded by any other characters (not even white space). As we saw previously, this declaration is also referred to as the XML prolog.
In this declaration, you can also define the language in which you have written your XML data. This is particularly important if your data contains characters that aren't part of the English ASCII character set. You can specify the language encoding using the optional encoding attribute:
<?xml version="1.0" encoding="iso-8859-1" ?>
The most common ones are shown in the following table:
|
Language |
Character set |
|
Unicode (8 bit) |
UTF-8 |
|
Latin 1 (Western Europe, Latin America) |
ISO-8859-1 |
|
Latin 2 (Central/Eastern Europe) |
ISO-8859-2 |
|
Latin 3 (SE Europe) |
ISO-8859-3 |
|
Latin 4 (Scandinavia/Baltic) |
ISO-8859-4 |
|
Latin/Cyrillic |
ISO-8859-5 |
|
Latin/Arabic |
ISO-8859-6 |
|
Latin/Greek |
ISO-8859-7 |
|
Latin/Hebrew |
ISO-8859-8 |
|
Latin/Turkish |
ISO-8859-9 |
|
Latin/Lappish/Nordic/Eskimo |
ISO-8859-10 |
|
Japanese |
EUC-JP or Shift_JIS |
If you want to read more about internationalization, check out the W3Cs page on this topic at http://www.w3.org/International/ .
Elements
As we have already seen, the XML document essentially consists of data marked up using tags. Each start-tag/end-tag pair, with the data that lies between them, is an element:
<mytag>Here we have some data</mytag>
The start and end tags must be exactly the same, except for the closing slash in the end-tag. Remember that they must be in the same case: <mytag> and <MyTag> would be considered as different tags.
The section between the tags that says, "Here we have some data", is called character data, while the tags either side are the markup. The character data can consist of any sequence of legal characters (conforming to the Unicode standard), except the start element character <. This is not allowed in case a processing application treats it as the start of a new tag. If you do need to include them you can represent them using the numeric character references in ASP; < for < and > for >.
The tags can start with a letter, an underscore (_), or a colon character (:), followed by any combination of letter, digits, hyphens, underscores, colons, or periods. The only exception is that you cannot start a tag with the letters XML in any combination of upper or lowercase letters. You are also advised not to start a tag with a colon, in case it gets treated as a namespace (something we shall meet later on).
Here is another example, marking up some details for a hardware store:
<inventory>
<buckets>
<bucket>
<make>Addis</make>
<capacity>3 litres</capacity>
</bucket>
<bucket>
<make>Metro</make>
<capacity>2.5 litres</capacity>
</bucket>
</buckets>
</inventory>
If you remember back to the three rules at the beginning of this section, you will be able to work out that this is a well-formed XML document. We have more than our one required element. We have a unique opening and closing tag: <inventory>, which is the root element. The elements are nested properly inside the root element.
Let's have a look at some more examples to help us get the idea how a well-formed XML document should be constructed.
At the simplest level we could have either:
<my_document></my_document>
or even
<my_document/>
To make sure that tags nest properly, there must be no overlap. So this is correct:
<parent>
<child>Some character data</child>
</parent>
while this would be incorrect:
<bad_parent>
<naughty_child>
Some character data
</bad_parent>
</naughty child>
This is because the closing </naughty_child> element is after the closing </bad_parent> element.
Attributes
Elements can have attributes. These are values that are passed to the application, but do not constitute part of the content of the element. Attributes are included as part of the element's opening tag, as in HTML. In XML all attributes must be enclosed in quote marks. For example:
<food healthy="yes">spinach</food>
Elements can have as many attributes as you want. So you could have:
<food healthy="no" tasty="yes" high_in_cholesterol="no">fries</food>
For well-formedness, however, you cannot repeat the attribute within an instance of the element. So you could not have:
<food tasty="yes" tasty="no">spinach</food>
Also, the string values between the quote marks can not contain the characters <, &, ' or ".
Other Features
There are also a number of other features of the XML specification that you need to learn if you progress to using XML frequently. Unfortunately there is not space to cover them all here. We will, however, briefly describe a few of them.
Entities
There are two categories of entity: general entities and parameter entities. Entities are usually used within a document as a way of avoiding having to type out long pieces of text several times within that document. They provide a way of associating a name with the long piece of text so that wherever you need to mention the text you just mention the name instead. As a result, if you have to modify the text, you only have to do it once (rather like the benefits offered by server-side includes).
CDATA Sections
CDATA sections can be used wherever character data can appear within a document. They are used to escape (or delimit) blocks of text that would otherwise be considered as markup. So if we wanted to include the whole of the following line, including the tags:
<to_be_seen>Always wear light clothing when walking in the dark</to_be_seen>
we could use a CDATA section like so:
<element>
<! [CDATA[ <to_be_seen>Always wear light clothing when walking in the dark</to_be_seen> ]]>
</element>
And the whole line, including the opening and closing <to_be_seen> tags, would not be processed or treated as tags by the receiving application.
Comments
It is always good programming practice to comment your code – it so much easier to read if it is commented in a manner that helps explain, reminds you about, or simply points out salient sections of code. It is surprising how code that seemed perfectly clear when you wrote it can soon become a jumble when you come back to it. While the descriptive XML tags often help you understand your own markup, there are times when the tags alone are not enough.
The good news is that comments in XML use exactly the same syntax as those in HTML:
<!--I really should add a comment here to remind me about xxxxx -->
In order to avoid confusing the receiving application, you should not include either the - or -- character in your comment text.
Processing Instructions
These allow documents to contain instructions for applications using the XML data. They take the form:
<?NameOfTargetApplication Instructions for Application?>
The target name cannot contain the letters xml in any combination of upper or lower case. Otherwise, you can create your own to work with the processing application (unless there are any predefined by the application at which you are targeting your XML).
In our next Try It Out, we will be looking at badly formed XML. We can tell a lot about whether our XML is well-formed by simply loading it into Internet Explorer 5. It has the ability to tell us about all sorts of errors (though it does let some slip). When you are first writing XML, it is very helpful to do this quick check so that you know your XML is well-formed.
Try It Out – Badly formed XML
1. Open up your books.xml file.
2. Remove the opening <book> tag
3. Save the file as bad_book.xml
4. Load it into Internet Explorer 5
Here is the result:
|
|
As you can see, the error message is pretty accurate. It more or less explicitly tells you that it was expecting an opening <book> tag. It certainly wouldn't take you long to find out what was wrong.
5. Put the opening book tag in again and change the line:
<title>Beginning ASP 3.0</title>
to
<title>Beginning ASP 3.0<title>
removing the closing slash.
|
6. Save the file again, and open it up in IE5 (or simply click the Refresh button, if you have it open already). You should get a result like this:
|
|
Again, we are not given the exact error, but IE was expecting a closing <title> tag, which it did not receive.
7. Finally, correct the closing <title> tag, and remove the opening quote form the US price attribute. Save the file and refresh your browser. This time you get the exact error:
|
|
While it is not the most elegant way to test code, it certainly does help find errors quickly. If you have made more than one error, just correct your mistakes one at a time and watch the error messages change.
| << 18.2.0- A Closer Look at Creating XML Documents | Chapter18 | 18.2.2- Valid Documents >> |

RSS




