XML

Overview

It is highly recommended to read the HTML section before reading this section. HTML and XML have a few similar concepts and terms that will make this section easier to understand.

XML stands for Extensible Markup Language. XML and HTML appear and sound very similar but have different goals. Whereas HTML was designed to display data, XML was designed to store data.

Like HTML, XML still has start tags, content, and end tags, however, rather than using a finite set of "legal" or "official" tags, XML tags are defined and created by the user. What does this mean? It means the following is perfectly legal XML.

<exam>
  <question>
    <text>What is red?</text>
    <solution>A color.</solution>
    <points>2</points>
  </question>
  <question>
    <text>What is a square?</text>
    <solution>A shape.</solution>
    <points>1</points>
  </question>
</exam>

Rather than opening that content in a web browser and expecting it to display data, we would instead write custom software to send, receive, store, display, or do something with the data.

Specifications

The W3C provides the official specification for the Extensible Markup Language (XML). They also have a succinct description. The W3C also provides the official specifications for

XPath expressions

XPath Expressions are expressions created to programmatically select nodes or node-sets from an XML document.

The following table (from w3schools) shows some of the most useful expressions that can be used to select nodes programmatically.

Unresolved include directive in modules/tools-and-standards/pages/data-formats/xml.adoc - include::example$expressions.csv[]

Predicates

A predicate is a way to filter a node-set by evaluating the expression contained within a set of square brackets []. For example, let’s say you wanted to get all of the questions that have points > 1. Doing this using predicates is easy.

<exam>
  <question>
    <text>What is red?</text>
    <solution>A color.</solution>
    <points>2</points>
  </question>
  <question>
    <text>What is a square?</text>
    <solution>A shape.</solution>
    <points>1</points>
  </question>
</exam>

The XPath expression to fetch the questions that are worth more than 1 point is the following.

//question[points > 1]

Operators

In order to make comparisons, we need operators. The following is a list of XPath operators from here.

Unresolved include directive in modules/tools-and-standards/pages/data-formats/xml.adoc - include::example$operators.csv[]

Functions

XPath functions are an easy way to filter data with more control. You can find a list of functions here. Of particular use are the contains and translate functions. contains allows you to see if the first argument string contains the second argument string. It is particularly useful to see if a class or attribute contains some substring. translate allows you to, among other things, change the case (whether the text is capitalized or not) of some text prior to using the contains function.

Examples

While the following examples are useful, to see examples using XPath expressions in R or Python, please check out the following links.

The following are examples using the following XML document. You can use this tool to test your XPath expressions.

<html>
    <head>
        <title>My Title</title>
    </head>
    <body>
        <div>
            <div class="abc123 sktoe-sldjkt dkjfg3-dlgsk">
                <div class="glkjr-slkd dkgj-0 dklfgj-00">
                    <a class="slkdg43lk dlks" href="https://example.com/123456">
                    </a>
                </div>
            </div>
            <div>
                <div class="ldskfg4">
                    <span class="slktjoe" aria-label="123 comments, 43 Retweets, 4000 likes">Love it.</span>
                </div>
            </div>
            <div data-amount="12">13</div>
        </div>
        <div>
            <div class="abc123 sktoe-sls dkjfg-dlgsk">
                <div class="glkj-slkd dkgj-0 dklfj-00">
                    <a class="slkd3lk dls" href="https://example.com/123456">
                    </a>
                </div>
            </div>
            <div>
                <div class="ldg4">
                    <span class="sktjoe" aria-label="1000 comments, 455 Retweets, 40000 likes">Love it.</span>
                </div>
            </div>
            <div data-amount="122">133</div>
        </div>
    </body>
</html>

Write an XPath expression to get the "title" element.

Solution
//title

Write an XPath expression to get the content of the "title" element.

Solution
//title/text

Write an XPath expression to get every "div" element in the document.

Solution
//div

Write an XPath expression to get every "div" element in the document with the "class" attribute having a value of "ldskfg4".

Solution
//div[@class="ldskfg4"]

Write an XPath expression to get every "div" element where the string "abc123" is in the "class" attribute’s value (as a substring).

Solution
//div[contains(@class, 'abc123')]

Write an XPath expression to get every "div" element with an "aria-label" attribute.

Solution
//div[@aria-label]

GPX

Resources

A w3schools tutorial on XPath expressions. A nice 1 page summary.

A simple, but very thorough cheatsheet to recall XPath expressions.

A great, online tool for testing XPath expressions.

XML Books

There are many books that cover XML and related topics.

  • Learning XML, 2nd Edition, by Erik T. Ray (O’Reilly, 2003), available at O’Reilly or Amazon

  • Learning XSLT by Michael Fitzgerald (O’Reilly, 2004), available at O’Reilly or Amazon

  • XML Hacks by Michael Fitzgerald (O’Reilly, 2004), available at O’Reilly or Amazon

  • XML in a Nutshell, 3rd Edition, by Elliotte Rusty Harold and W. Scott Means (O’Reilly, 2004), available at O’Reilly or Amazon

  • XPath and XPointer by John Simpson (O’Reilly, 2002), available at O’Reilly or Amazon

  • XQuery, 2nd Edition, by Priscilla Walmsley (O’Reilly, 2016), available at O’Reilly or Amazon

  • XSLT, 2nd Edition, by Doug Tidwell (O’Reilly, 2008), available at O’Reilly or Amazon

  • XSLT Cookbook, 2nd Edition, by Sal Mangano (O’Reilly, 2006), available at O’Reilly or Amazon

  • XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang (Springer, 2014), available at Springer Link or Amazon