TDM 20200: Project 1 — 2024

Motivation: Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common.

Context: In this project we will use the lxml package in Python. This is a first project focusing on web scraping

Scope: python, XML

Learning objectives

Review and summarize the differences between XML and HTML/CSV.
Match XML terms to sections of XML demonstrating working knowledge.

Readings and Resources

This link will show you more information about lxml

Check out this old project that uses a different dataset — you may find it useful for this project.

Please checkout this Example Book lxml introduction.

We also added some specific examples about using lxml to parse the XML files in the otc data set.

lxml examples with the otc data set

These examples can be especially helpful for you as you work on Project 1.

We also added 5 more videos for you, which are specifically about Project 1

Dataset(s)

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/otc/valu.xml

Questions

Question 1 (2 points)

Please use the lxml package to find and print the name of the root element’s tag for the XML file of valu.xml

Question 2 (2 points)

Please use 'xpath' with namespace to find the 'title' element.

There are 13 title elements altogether. Please show the contents of the very first title element, since this is the title of the page.

You will need to use namespace for xpath; otherwise, you won’t get the element. This link will show you more information using about XPath with lxml

Question 3 (2 points)

Please use 'xpath' with namespace to find and list all child elements directly under the 'document' element in the xml file.

Question 4 (2 points)

Please get and list all author elements, including their child elements and attributes.

To print an Element object, you may refer to the following sample code. You will need to modify the example code to fit to your code.

print(etree.tostring(my_element, pretty_print=True).decode('utf-8'))

Question 5 (2 points)

Please list all codeSystem attribute values from the file.
Please list the codeSystem value for which the 'displayName' attribute contains the string 'DOSAGE'.

You can use the .attrib attribute to access the attributes of an Element. It is a dict-like object, so you can access the attributes similarly to how you would access the values in a dictionary.

This link may help you when figuring out how to select the right elements

Project 01 Assignment Checklist

Jupyter Lab notebook with your code, comments and output for the assignment
- firstname-lastname-project01.ipynb.
Python file with code and comments for the assignment
- firstname-lastname-project01.py
Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.