STAT 39000: Project 1 — Spring 2021

Motivation: Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common.

Context: In previous semesters we’ve explored XML. In this project we will refresh our skills and, rather than exploring XML in R, we will use the lxml package in Python. This is the first project in a series of 5 projects focused on web scraping in R and Python.

Scope: python, XML

Learning objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Match XML terms to sections of XML demonstrating working knowledge.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


The following questions will use the dataset found in Scholar:



We realize that it may be a while since you’ve used Python. That’s okay! We are going to be taking things at a much more reasonable pace than Spring 2020.

Some potentially useful resources for the semester include:

  • The STAT 19000 projects. We are easing 19000 students into Python and will post solutions each week. It would be well worth 10 minutes to look over the questions and solutions each week.

  • Here is a decent cheat sheet that helps you quickly get an idea of how to do something you know how to do in R, in Python.

  • The Examples Book — updating daily with more examples and videos. Be sure to click on the "relevant topics" links as we try to point you to topics with examples that should be particularly useful to solve the problems we assign.


It would be well worth your time to read through the XML section of the book, as well as take the time to work through pandas 10 minute intro.

Question 1

A good first step when working with XML is to get an idea how your document is structured. Normally, there should be good documentation that spells this out for you, but it is good to know what to do when you don’t have the documentation. Start by finding the "root" node. What is the name of the root node of the provided dataset?

Make sure to import the lxml package first:

from lxml import etree

Here are two videos about running Python in RStudio:

And here is a video about XML scraping in Python:

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 2

Remember, XML can be nested. In question (1) we figured out what the root node was called. What are the names of the next "tier" of elements?

Now that we know the root node, you could use the root node name as a part of your xpath expression.

As you may have noticed in question (1) the xpath method returns a list. Sometimes this list can contain many repeated tag names. Since our goal is to see the names of the second "tier" elements, you could convert the resulting list to a set to quickly see the unique list as `set’s only contain unique values.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 3

Continue to explore each "tier" of data until there isn’t any left. Name the "full paths" of all of the "last tier" tags.

Let’s say a "last tier" tag is just a path where there are no more nested elements. For example, /HealthData/Workout/WorkoutRoute/FileReference is a "last tier" tag. If you try and get the nested elements for it, they don’t exist:


Here are 3 of the 7 "full paths":

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

At this point in time you may be asking yourself "but where is the data"? Depending on the structure of the XML file, the data could either be between tags like:


Or, it could be in an attribute:

<question answer="tac">What is cat spelled backwards?</question>

Collect the "ActivitySummary" data, and convert the list of dicts to a pandas DataFrame. The following is an example of converting a list of dicts to a pandas DataFrame called myDF:

import pandas as pd
list_of_dicts = []
list_of_dicts.append({'columnA': 1, 'columnB': 2})
list_of_dicts.append({'columnB': 4, 'columnA': 1})
myDF = pd.DataFrame(list_of_dicts)

It is important to note that an element’s "attrib" attribute looks and feels like a dict, but it is actually a lxml.etree._Attrib. If you try to convert a list of lxml.etree._Attrib to a pandas DataFrame, it will not work out as you planned. Make sure to first convert each lxml.etree._Attrib to a dict before converting to a DataFrame. You can do so like:

# this will convert a single `lxml.etree._Attrib` to a dict
my_dict = dict(my_lxml_etree_attrib)
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 5

pandas is a Python package that provides the DataFrame and Series classes. A DataFrame is very similar to a data.frame in R and can be used to manipulate the data within very easily. A Series is the class that handles a single column of a DataFrame. Go through the pandas in 10 minutes page from the official documentation. Sort, find, and print the top 5 rows of data based on the "activeEnergyBurned" column.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.