TDM 20200: Project 1 — 2023

Motivation: Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common.

Context: In this project we will use the lxml package in Python. This is the first project in a series of 5 projects focused on web scraping in Python.

Scope: python, XML

Learning objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Match XML terms to sections of XML demonstrating working knowledge.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/otc/hawaii.xml

Questions

It would be well worth your time to:

  1. Read the HTML section of the book.

  2. Read through the XML section of the book.

  3. Finally, take the time to work through pandas 10 minute intro.

Question 1

Check out this old project that uses a different dataset — you may find it useful for this project.

One of the challenges of XML is that it can be hard to get a feel for how the data is structured — especially in a large XML file. A good first step is to find the name of the root node. Use the lxml package to find and print the name of the root node.

Interesting! If you took a look at the previous project, you probably weren’t expecting the extra {urn:hl7-org:v3} part in the root node name. This is because the previous project’s dataset didn’t have a namespace! Namespaces in XML are a way to prevent issues where a document may have multiple sets of node names that are identical but have different meanings. The namespaces allow them to exist in the same space without conflict.

Practically what does this mean? It makes XML parsing ever-so-slightly more annoying to perform. Instead of being able to enter XPath expressions and return elements, we have to define a namespace as well. This will be made more clear later.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

XML can be nested — there can be elements that contain other elements that contain other elements. In the previous question, we identified the root node AND the namespace. Just like in the previous Spring 2021 project 1 (linked in the "tip" in question 1), we would like you to find the names of the next "tier" of elements.

This will not be a copy/paste of the previous solution. Why? Because of the namespace!

First, try to use the same method from question (2) from this project to find the next tier of names. What happens?

hawaii.xpath("/document") # won't work
hawaii.xpath("{urn:hl7-org:v3}document") # still won't work with the namespace there

How do we fix this? We must define our namespace, and reference it in our XPath expression. For example, the following will work.

hawaii.xpath("/ns:document", namespaces={'ns': 'urn:hl7-org:v3'})

Here, we are passing a dict to the namespaces argument. The key is whatever we want to call the namespace, and the value is the namespace itself. For example, the following would work too.

hawaii.xpath("/happy:document", namespaces={'happy': 'urn:hl7-org:v3'})

So, unfortunately, every time we want to use an XPath expression, we have to prepend namespace: before the name of the element we are looking for. This is a pain, and unfortunately we cannot just define it once and move on.

Okay, given this new information, please find the next "tier" of elements.

There should be 8.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Okay, lucky for you, this XML file is not so big! Use your UNIX skills you refined last semester to print the content of the XML file. You can print the entirety in a bash cell if you wish.

You will be able to see that it contains information about a drug of some sort.

Knowing now that there are ingredient elements in the XML file. Write Python code, and an XPath expression to get a list of all of the ingredient elements. Print the list of elements.

When we say "print the list of elements", we mean to print the list of elements. For example, the first element would be:

<ingredient classCode="IACT">
    <ingredientSubstance>
        <code code="O7TSZ97GEP" codeSystem="2.16.840.1.113883.4.9"/>
        <name>DIBASIC CALCIUM PHOSPHATE DIHYDRATE</name>
    </ingredientSubstance>
</ingredient>

To print an Element object, see the following.

print(etree.tostring(my_element, pretty_print=True).decode('utf-8'))
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

At this point in time you may be wondering how to actually access the bits and pieces of data in the XML file.

There is data between tags.

<name>DIBASIC CALCIUM PHOSPHATE DIHYDRATE</name>

To access such data from the "name" Element (which we will call my_element below) you would do the following.

my_element.text # DIABASIC CALCIUM PHOSPHATE DIHYDRATE

There is also data tucked away in a tag’s attributes.

<code code="O7TSZ97GEP" codeSystem="2.16.840.1.113883.4.9"/>

To access such data from the "name" Element (which we will call my_element below) you would do the following.

my_element.attrib['code'] # O7TSZ97GEP
my_element.attrib['codeSystem'] # 2.16.840.1.113883.4.9

The aspect of XML that we are interested in learning about are XPath expressions. XPath expressions are a clear and effective way to extract elements from an XML document (or HTML document — think extracting data from a webpage!).

In the previous question you used an XPath expression to find all of the ingredient elements, regardless where they were or how they were nested in the document. Let’s practice more.

If you look at the XML document, you will see that there are a lot of code attributes. Use lxml and XPath expressions to first extract all elements with a code attribute. Print all of the values of the code attributes.

Repeat the process, but modify your XPath expression (not your Python code, just the XPath expression) so that it only keeps elements that have a code attribute that starts with a capital "C". Print all of the values of the code attributes.

You can use the .attrib attribute to access the attributes of an Element. It is a dict-like object, so you can access the attributes similarly to how you would access the values in a dictionary.

This link may help you when figuring out how to select the elements where the code attribute must start with "C".

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

The quantity element contains a numerator and a denominator element. Print all of the quantities in the XML file, where a quantity is defined as the value of the value attribute of the numerator element divided by the value of the value attribute of the corresponding denominator element. Lastly, print the unit (part of the numerator element afterwards.

The results should read as follows:

1.0 1
5.0 g
7.6 mg
5.0 g
4.0 g
230.0 mg
4.0 g

You may need to use the float function to convert the string values to floats.

You can use the xpath method on an Element object. When doing so, if you want to limit the scope of your XPath expression, make sure to start the xpath with ".//ns:" this will start the search from within the element instead of searching the entire document.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.