TDM 20200: Project 4 — 2024

Motivation: It is worthwhile to learn how to parse hundreds of thousands of files systematically. We practice this skill, step by step.

Context: We return to the over-the-counter medications from Project 1, aiming to extract the ingredient substances from each medication, and creating a tally of all of the ingredient substances.

Scope: Python, XML

Learning Objectives
  • Extract XML content from a large number of files and summarize some of the content in the files.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/otc/archive1

through

  • /anvil/projects/tdm/data/otc/archive10

  • It is worthwhile to review the skills learned in Project 1, in which we analyzed the files:

/anvil/projects/tdm/data/otc/valu.xml

and

/anvil/projects/tdm/data/otc/hawaii.xml

  • We already downloaded and uncompressed the 10 zip files from the website dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm for the Human OTC Labels data. We only stored the .xml files and removed the .jpg files (we do not need the .jpg files for this project).

  • The uncompressed files are stored in the directories:

  • /anvil/projects/tdm/data/otc/archive1

through

  • /anvil/projects/tdm/data/otc/archive10

When building the dictionary, you will see that Dr Ward writes:

if mystring not in mydict:
    mydict[mystring] = 1
else:
    mydict[mystring] += 1

Some of you will not have worked with dictionaries too much in the past. Dictionaries start out empty, and you need to add the words as you go.

Alternatively, if you want to, you can just write:

mydict[mystring] = mydict.get(mystring, 0) + 1

This approach is a little bit cleaner, but I didn’t know if you would understand it. If mystring is missing from the dictionary, then mydict.get(mystring, 0) will return 0, so that mydict[mystring] is created with a value of 1. Otherwise, if mystring is already in the dictionary, then mydict.get(mystring, 0) will return the previous number of occurrences, so that mydict[mystring] gets bumped up by 1, to account for the new occurrence of mystring.

Either approach is OK, and you might have another Pythonic way that you want to handle this step in creating the dictionary.

Questions

Question 1 (2 points)

Run the lines:

import pandas as pd
import lxml.etree
import glob
  1. Remind yourself how to extract the ingredient substances from each of these two files: /anvil/projects/tdm/data/otc/valu.xml and /anvil/projects/tdm/data/otc/hawaii.xml

For each of these two files, print a list of all ingredient substances (it is OK if some are repeated; also, do not worry about which ingredient that the ingredient substances come from). For instance, if you extract the ingredient substances from the file

/anvil/projects/tdm/data/otc/valu.xml

you should get these ingredient substances:

HYPROMELLOSES
MINERAL OIL
POLYETHYLENE GLYCOL, UNSPECIFIED
POLYSORBATE 80
POVIDONE, UNSPECIFIED
... blah blah blah ...
STARCH, CORN
SODIUM STARCH GLYCOLATE TYPE A CORN
STEARIC ACID
TITANIUM DIOXIDE
ACETAMINOPHEN

or if you extract the ingredient substances from the file

/anvil/projects/tdm/data/otc/hawaii.xml

you should get these ingredient substances:

DIBASIC CALCIUM PHOSPHATE DIHYDRATE
WATER
SORBITOL
SODIUM LAURYL SULFATE
CARBOXYMETHYLCELLULOSE SODIUM, UNSPECIFIED FORM
... blah blah blah ...
WHITE WAX
MANGIFERA INDICA SEED BUTTER
ROSEMARY OIL
TOCOPHEROL
ZINC OXIDE

Question 2 (2 points)

  1. Use this Python code: for myfile in glob.glob("/anvil/projects/tdm/data/otc/archive1/*.xml")[0:11] and use also this code: tree = lxml.etree.parse(myfile) to loop over the first eleven files in the archive1 directory. Print all of the ingredient substances from these first eleven files.

  2. Make a Python dictionary (called a dict in Python) from the ingredient substances, keeping track of the number of times that each ingredient substance occurs.

Question 3 (2 points)

  1. Convert the dictionary from question 2b to a data frame.

  2. Sort the dataframe according to the counts, and print the 5 most popular ingredient substances from those 10 files, and the number of times that each of these 5 most popular ingredient substances occurs. Your output should contain:

COCAMIDOPROPYL BETAINE    60
FD&C BLUE NO. 1           70
CITRIC ACID MONOHYDRATE   87
GLYCERIN                  93
WATER                    114

Question 4 (2 points)

  1. Now analyze the first 1000 files from the archive1 directory, and print the output that shows the 5 most popular ingredient substances from those 1000 files, and the number of times that each of these 5 most popular ingredient substances occurs.

  2. Now try to analyze all of the files from the archive1 directory. Likely, your work will fail, because there is at least one enormous file that needs a little bit fancier parsing method! So you can add these lines:

from lxml.etree import XMLParser, parse
p = XMLParser(huge_tree=True)

and then add the parameter parser=p to your parse statement. Now you can analyze all of the files from the archive1 directory. Print output that shows the 5 most popular ingredient substances from all of the files (altogether) in the archive1 directory, and the number of times that each of these 5 most popular ingredient substances occurs.

Question 5 (2 points)

  1. Now analyze all of the files in all 10 directories archive1 through archive10, and print output that shows the 5 most popular ingredient substances from all of the files (altogether) in these 10 directories, and the number of times that each of these 5 most popular ingredient substances occurs.

Project 04 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project04.ipynb

  • Python file with code and comments for the assignment

    • firstname-lastname-project04.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.