XML introduction

Introduction to XML and associated languages - Part I: XML.

Part I: XML

Part II: DOM

Part III: XPath

Part IV: XQuery

Part V: XSLT

XML is a software- and hardware-independent tool for storing and transporting data.

XML stands for eXtensible Markup Language.
XML is a markup language much like HTML.
XML was designed to store and transport data.
XML was designed to be self-descriptive.
XML is a W3C Recommendation.

The paragraph above is how XML is described on the W3schools website.

This tutorial is an introduction to XML itself, but also to other languages used while working with XML documents, in particular XPath, XQuery, and XSLT. If not otherwise stated, the program samples have been developed and tested on Windows 10. Whereas XML is completely platform-independent, some programs may not work as expected, if you use another operating system. The reader is supposed to know how to create a webpage using HTML and Javascript; having some basic knowledge of PHP, Perl, Pascal, and/or other programming languages will make it easier to understand the program samples used throughout the tutorial. No knowledge of XML is required.

The important point to know with XML is that it's not like other languages (in particular like HTML). Whereas those other languages "do something" (HTML, for example, displays some information/data in a way that is defined by the HTML code), XML "does nothing at all". Its only reason of existence is to give a description of some data. To read, write, store, or display the content of an XML file, a specific piece of software has to be written.

Let's take an example. The file Alanine.xml contains a description of the amino acid of same name. This description includes the amino acid's 1-letter and 3-letters code, its name, as well as its simple and extended molecular formula.
<?xml version="1.0" encoding="UTF-8"?>
<amino-acid>
<code1>A</code1>
<code3>Ala</code3>
<name>Alanine</name>
<formula1>C3H7NO2</formula1>
<formula2>CH3-CH(NH2)-COOH</formula2>
</amino-acid>

Describing our amino acid this self-descriptive way, the XML file is easily readable and fully understandable by humans. And, as it's just plain text, it can easily be passed from one application to another, or from one computer to another, independently of the operating system that these computers run.

Also note that XML is extensible. If, for example we change our file adding the Latin name of the amino acid, this would consist in adding a further tag, without changing what was before. In other words, the programs, that had been written for the old version of the file, will still work with the new version.

Using XML to store data can largely simplify things. On the W3schools website, they put it like this: "Many computer systems contain data in incompatible formats. Exchanging data between incompatible systems (or upgraded systems) is a time-consuming task for web developers. Large amounts of data must be converted, and incompatible data is often lost. XML stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data. XML also makes it easier to expand or upgrade to new operating systems, new applications, or new browsers, without losing data. With XML, data can be available to all kinds of "reading machines" like people, computers, voice machines, news feeds, etc."

At the beginning of the tutorial, I said that XML is a markup language, just as HTML. Now, what are the big differences between these two languages?

XML was designed to carry data - with focus on what data is.
HTML was designed to display data - with focus on how data looks.
XML tags are not predefined like HTML tags are.

This latter point is an important feature of XML. We are totally free to choose the tag names, thus we have the possibility to choose a tag that "lets no doubt what the meaning of the associated data is". By the way, using the tags <simpleFormula> and <extendedFormula> (instead of <formula1> and <formula2>) would have been a somewhat better choice in our amino acid example. Important: With XML, each opening tag always requires a corresponding closing tag!

What do you think happens if we open an XML file in a web browser? XML says nothing about how the data displayed should look like, so all the browser can do is display the file content. Not exactly true: Modern web browsers display XML files as a document tree, kind of hierarchical display of the data, depending on the relative position of the different tags within the tree. The screenshot shows the file Alanine.xml in Firefox.

Display of an XML document in Firefox web browser

XML Copy Editor.

As XML documents are plain text, a simple text editor like Windows Notepad can be used to work with such files. A better choice would be the "best text editor ever" Notepad++, that has syntax highlighting for XML (as for lots of other languages). There are, however, also specialized XML editors, that not only include advanced editor features (like pretty-print, or syntax validation), but also may include extras like DTD and XML Schema support, searching XML documents using XPath expressions, or converting XML documents using XSLT.

There are several free XML editors available on the Internet. I just tried one of them, and I think that it includes everything that we could wish when working with XML documents. The software is called XML Copy Editor and you can download it from Sourceforge. You can install it only for yourself or for all users. You can also associate a whole bunch of file extensions with the application. By default, all available choices are selected; I kept it like that.

If an XML document doesn't contain line breaks, the whole XML being in one single line, its display in Notepad++ doesn't make much sense. In XML Copy Editor just choose XML > Pretty-print from the menu bar, and that's it. The document is displayed as it should, i.e. as a document tree, with possibility to fold/unfold the different <amino-acid> tags used in this file, that contains information about all 20 amino acids being used to build proteins.

XML document opened in the freeware application 'XML Copy Editor'

Structure of an XML document.

An XML document starts with a prolog defining the XML version and the character encoding; e.g.
<?xml version="1.0" encoding="UTF-8"?>

Then follows the document data, organized as a document tree. The XML tree starts with one single root element (in the example in the editor: <amino-acids>), and branches from the root to one or more child elements (in the example in the editor: the different <amino-acid> elements). All elements can have sub-elements, i.e. further child elements (in the example in the editor: the <amino-acid> elements have 5 children: <code1>, <code3>, etc).

The terms parent, child, and sibling are used to describe the relationships between elements. The root element is the parent of all other elements. The element <name> is a child of the element <amino-acid>, and a sibling of the elements <code1> and <code3>.

Like HTML elements, XML elements can have attributes. Attributes are designed to contain data related to a specific element. Attribute values must always be enclosed in quotes. Either single or double quotes can be used.

As an example, let's rewrite Alanine.xml using attributes for the 1-letter and 3-letters code:
<?xml version="1.0" encoding="UTF-8"?>
<amino-acid code1="A" code3="Ala">
<name>Alanine</name>
<formula1>C3H7NO2</formula1>
<formula2>CH3-CH(NH2)-COOH</formula2>
</amino-acid>

This is essentially the same as before. Thus, the question rises: When should we use elements, and when attributes? The following page at the W3schools website discusses the problems related to attributes. I think that the best practice is to use elements for all that is part of the data itself (that's all items in our example); use attributes for metadata, i.e. items that are not actually part of the data itself, but contain some supplementary information. A typical example is an ID, that has nothing to do with the data itself, but is only used to identify a given element.

An XML document can contain comments. They start with . Two dashes in the middle of a comment are not allowed.

I already mentioned that each opening tag (start tag) must have a corresponding closing tag (end tag). It's quite obvious that the different tags have to be properly nested.

Some further important XML syntax rules:

XML tags are case sensitive.
XML does not truncate multiple white-spaces (remember that HTML truncates multiple white-spaces to one single white-space).
XML Stores New Line as LF (Windows applications store it as CR+LF).
Some characters have a special meaning in XML, and if used inside an XML element would generate an error. To avoid this problem, replace the concerned character with an entity reference (cf. below).

The most obvious character that would cause problems is the < sign, that the parser would interpret as an opening tag, as the start of a new element. The entity reference for < is < (this is exactly the same as in HTML). So, for example, instead of writing
<message>Balance < 1000€!</message>
(that would result in an error), we have to write:
<message>Balance < 1000€!</message>

There are 5 pre-defined entity references in XML:

<	<	less than
>	>	greater than
&	&	ampersand
'	'	apostrophe
"	"	quotation mark

Strictly speaking, only < and & are illegal in XML, but it is a good habit to replace the others, too, in particular >.

Note: An XML document, that conforms to the syntax rules above, is said to be a well formed XML document.

Namespaces.

In XML, element names are defined by the developer. This often results in a conflict when trying to mix XML documents from different XML applications, and even more if the XML documents have been created by people working for different societies. Consider the following example (from the W3schools website):

XML element <table> containing information about some HTML table:
<table>
<tr>
<td>Apples</td>
<td>Bananas</td>
</tr>
</table>
XML element <table> containing information about a table (a piece of furniture):
<table>
<name>African Coffee Table</name>
<width>80</width>
<length>120</length>
</table>

Obviously these two tables have nothing in common, and if for some reason we have to use both XML elements, we must find some possibility to solve the name conflict.

The simplest way to do this is using a prefix. In XML, prefixes must be associated with a qualified namespace. The namespace can be defined by an xmlns attribute in the start tag of an element. When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.

Example (continuing from the W3schools website):
<root>
<h:table xmlns:h="http://www.w3.org/TR/html4/">
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table xmlns:f="https://www.w3schools.com/furniture">
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
</root>

We may also declare the namespaces in the root element of the XML document. Example (still continuing from the W3schools website):
<root xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="https://www.w3schools.com/furniture">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
</root>

Default namespace.

Defining a default namespace for an element saves us from using prefixes in all the child elements. Syntax:
xmlns="namespace-URI"

A Uniform Resource Identifier (URI) is a string of characters which (uniquely) identifies an Internet resource. The most common URI is the Uniform Resource Locator (URL) which identifies an Internet domain address; another, not really commonly used, type of URI is the Uniform Resource Name (URN).

Specifying a default name space is not mandatory, but it is good practice to do so. The purpose of using an URI is to give the namespace a unique name. It is not used by the parser to look up information. Thus, the URL specified must not necessarily point to an existing Internet resource. However, companies often use the namespace as a pointer to a web page containing namespace information.

Here is a new version of the file Alanine.xml form the beginning of the tutorial. The root element contains a namespace declaration (including a default prefix), using the URI "http://www.microfocus.com/xcentrisity/xml-extensions/symbol-table/". Where does this URI come from? In fact, the XML file has been generated by a Visual COBOL program (cf. my tutorial An introduction to Visual COBOL XML Extensions), and the Visual COBOL xml export file statement adds a URL on the Microfocus website.
<?xml version="1.0" encoding="UTF-8"?>
<amino-acid xmlns:xtk="http://www.microfocus.com/xcentrisity/xml-extensions/symbol-table/">
<code1>A</code1>
<code3>Ala</code3>
<name>Alanine</name>
<formula1>C3H7NO2</formula1>
<formula2>CH3-CH(NH2)-COOH</formula2>
</amino-acid>

DTD and XML Schema.

An XML document with correct syntax is called a well formed XML document. An XML document with a correct structure is called a valid XML document. Whereas the correct syntax is defined by the XML language itself, the correct structure depends on the document; in other words: it's the developer, who creates a given XML document, who decides how this document must be structured in order to be valid.

In the case of our file Alanine.xml (and similar files for other amino acids), this means, that the root element must be called "<amino-acid>", and that this element must have 5 child elements called respectively "<code1>", "</code3>", "<name>", "<formula1>", and "<formula2>". Also, all of the child elements must be strings.

There are 2 specifications that can be used to define the structure and the legal elements and attributes of an XML document:

Document Type Definition (DTD). I think that DTD is no longer commonly used (?). Its syntax is rather complicated and entirely different from the XML syntax.
XML Schema. They are lots more powerful than DTD, with the following advantages:
- XML Schemas are written in XML
- XML Schemas are extensible to additions
- XML Schemas support data types
- XML Schemas support namespaces

Here is the code of a DTD, that can be used to validate our amino acid files:
<!DOCTYPE amino-acid
[
<!ELEMENT amino-acid (code1,code3,name,formula1,formula2)>
<!ELEMENT code1 (#PCDATA)>
<!ELEMENT code3 (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT formula1 (#PCDATA)>
<!ELEMENT formula2 (#PCDATA)>
]>

!DOCTYPE defines the root element. !ELEMENT describes the different elements. !ELEMENT amino-acid defines that the element "amino-acid" must contain the child elements "code1", "code3", "name", "formula1", "formula2" (in this sequence). !ELEMENT code1 (and the remaining ones) defines this element to be of type #PCDATA (parseable character data). For further information concerning DTD, have a look at the chapter XML DTD at the W3schools website.

And here is the corresponding code of an XML Schema:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="amino-acid">
<xs:complexType>
<xs:sequence>
<xs:element name="code1" type="xs:string"/>
<xs:element name="code2" type="xs:string"/>
<xs:element name="name" type="xs:string"/>
<xs:element name="formula1" type="xs:string"/>
<xs:element name="formula2" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

<xs:element name="amino-acid"> defines the element called "amino-acid". <xs:complexType> defines that the "amino-acid" element is a complex type. <xs:sequence> defines that the complex type is a sequence of elements. <xs:element name="code1" type="xs:string"> (and the remaining ones) defines that this element is of type string (text). For further details, have a look at the chapter XML Schema at the W3schools website.

To have our amino acid XML files validated by the XML schema, we'll have to add the schema location. Syntax:
<amino-acid xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="schema-URI">
Or, using a URN:
<amino-acid xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="file:schema-path">

Notes:

If you specify a namespace-URI in the XML document, you'll have to add the attribute targetNamespace="namespace-URI" to the xs:schema element in the schema file.
The URN of a Windows path is specified in a "Linux-like" way (we use the same way, for example, in the MSYS2 terminals). Example:
xsi:noNamespaceSchemaLocation="file:/C:/Users/allu/Programming/XML/Amino-acid.xsd".

XML validation using XML Copy Editor.

The screenshot below shows the original version of Alanine.xml, opened in XML Copy Editor, after I had pushed the Check well-formedness button (blue "OK" icon).

XML Copy Editor: Checking the well-formedness of an XML document

If we try to check the validity of the document at this stage, pushing the Validate button (green "OK" icon), we'll get the error message No declaration found for element 'amino-acid'. This means that there is no description of the document structure and the legal elements and attributes available, and thus the document can't be validated.

Instead of adding the declaration of the schema manually to the XML document, we can let do that by XML Copy Editor. From the menu, choose XML > Associate > XML Schema....

XML Copy Editor: Associating an XML schema with an XML document

A dialog box opens, and we can enter the path to the schema file, or use the Browse button to navigate to the schema file. XML Copy Editor will then automatically insert the xmlns:xsi and xsi:noNamespaceSchemaLocation attributes to the amino-acid element. The screenshot shows the modified XML document and the successful validation against the schema amino-acid.xsd (that contains the code shown further up in the text).

XML Copy Editor: Successful validation of an XML document against an XML schema

And to terminate this part of the tutorial, let's modify the file Alanine.xml, exchanging the positions of the formula1 and formul2 elements. Validating the modified document will result in an error message, of course (note that XML Copy Editor stops validation after the detection of the first error).

XML Copy Editor: Invalid XML document (validated against an XML schema)

Note: When opening an invalid XML document with a web browser, the document tree is displayed based on the document's content. There is no validation check done, and no error message is displayed if the document isn't valid...

If you find this text helpful, please, support me and this website by signing my guestbook.

Computing: DOS, OS/2 & Windows Programming

Introduction to XML and associated languages - Part I: XML.