STAT 39000: Project 14 — Fall 2020

Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.

Context: We’ve just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.

Scope: awk, UNIX utilities, bash scripts, fread

Learning objectives
  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.

  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.

  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.

  • Use grep to search files effectively.

  • Use cut to section off data from the command line.

  • Use piping to string UNIX commands together.

  • Use awk for data extraction, and preprocessing.

  • Create bash scripts to automate a process or processes.


The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar:


To read more about ENTIRE_PLOTSNAP.csv that you will be working with:


Question 1

Take a look at at ENTIRE_PLOTSNAP.csv. Write a line of awk code that displays the STATECD followed by the number of rows with that STATECD.

Items to submit
  • Code used to solve the problem.

  • Count of the following `STATECD`s: 1, 2, 4, 5, 6

Question 2

Unfortunately, there isn’t a very accessible list available that shows which state each STATECD represents. This is no problem for us though, the dataset has LAT and LON! Write some bash that prints just the STATECD, LAT, and LON.

There are 92 columns in our dataset: awk -F, 'NR==1{print NF}' ENTIRE_PLOTSNAP.csv. To create a list of STATECD to state, we only really need STATECD, LAT, and LON. Keeping the other 89 variables will keep our data at 2.6gb.

Items to submit
  • Code used to solve the problem.

  • The output of your code piped to head.

Question 3

fread is a "Fast and Friendly File Finagler". It is part of the very popular data.table package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd argument in conjunction with your bash code from (2) to read the data of STATECD, LAT, and LON into a data.table in your R environment.

Items to submit
  • Code used to solve the problem.

  • The head of the resulting data.table.

Question 4

We are going to further understand the data from question (3) by finding the actual locations based on the LAT and LON columns. We can use the library revgeo to get a location given a pair of longitude and latitude values. revgeo uses a free API hosted by photon in order to do so.

For example:

revgeo(longitude=-86.926153, latitude=40.427055, output='frame')

The code above will give you the address information in six columns, from the most-granular housenumber to the least-granular country. Depending on the coordinates, revgeo may or may not give you results for each column. For this question, we are going to keep only the state column.

There are over 4 million rows in our dataset — we do not want to hit photon’s API that many times. Instead, we are going to do the following:

  • Unless you feel comfortable using data.table, convert your data.table to a data.frame:

my_dataframe <- data.frame(my_datatable)
  • Calculate the average LAT and LON for each STATECD, and call the new data.frame, dat. This should result in 57 rows of lat/long pairs.

  • For each row in dat, run a reverse geocode and append the state to a new column called STATE.

To calculate the average LAT and LON for each STATECD, you could use the sqldf package to run SQL queries on your data.frame.

mapply is a useful apply function to use to solve this problem.

Here is some extra help:

points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))
# Note that the "output" argument gets passed to the "revgeo" function.
mapply(revgeo, points$longitude, points$latitude, output="frame")
# The output isn't in a great format, and we'd prefer to just get the "state" data.
# Let's wrap "revgeo" into another function that just gets "state" and try again.
get_state <- function(lon, lat) {
  return(revgeo(lon, lat, output="frame")["state"])
mapply(get_state, points$longitude, points$latitude)

It is okay to get "Not Found" for some of the addresses.

Items to submit
  • Code used to solve the problem.

  • The head of the resulting data.frame.

Question 5

Use the leaflet, addTiles, and addCircles functions from the leaflet package to map our average latitude and longitude data from question (4) to a map (should be a total of 57 lat/long pairs).

See here for an example of adding points to a map.

Items to submit
  • Code used to create the map.

  • The map itself as output from running the code chunk.

Question 6

Write a bash script that accepts at least 1 argument, and performs a useful task using at least 1 dataset from the forest folder in /anvil/projects/tdm/data/forest/. An example of a useful task could be printing a report of summary statistics for the data. Feel free to get creative. Note that tasks must be non-trivial — a bash script that counts the number of lines in a file is not appropriate. Make sure to properly document (via comments) what your bash script does. Also ensure that your script returns columnar data with appropriate separating characters (for example a csv).

Items to submit
  • The content of your bash script starting from #!/bin/bash.

  • Example output from running your script as intended.

  • A description of what your script does.

Question 7

You used fread in question (2). Now use the cmd argument in conjunction with your script from (6) to read the script output into a data.table in your R environment.

Items to submit
  • The R code used to read in and preprocess your data using your bash script from (6).

  • The head of the resulting data.table.