STAT 29000: Project 14 — Fall 2020

Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.

Context: We’ve just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.

Scope: awk, UNIX utilities, bash scripts, fread

Learning objectives
  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.

  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.

  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.

  • Use grep to search files effectively.

  • Use cut to section off data from the command line.

  • Use piping to string UNIX commands together.

  • Use awk for data extraction, and preprocessing.

  • Create bash scripts to automate a process or processes.


The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar:


To read more about ENTIRE_PLOTSNAP.csv that you will be working with:


Question 1

Take a look at at ENTIRE_PLOTSNAP.csv. Write a line of awk code that displays the STATECD followed by the number of rows with that STATECD.

Items to submit
  • Code used to solve the problem.

  • Count of the following `STATECD`s: 1, 2, 4, 5, 6

Question 2

Unfortunately, there isn’t a very accessible list available that shows which state each STATECD represents. This is no problem for us though, the dataset has LAT and LON! Write some bash that prints just the STATECD, LAT, and LON.

There are 92 columns in our dataset: awk -F, 'NR==1{print NF}' ENTIRE_PLOTSNAP.csv. To create a list of STATECD to state, we only really need STATECD, LAT, and LON. Keeping the other 89 variables will keep our data at 2.6gb.

Items to submit
  • Code used to solve the problem.

  • The output of your code piped to head.

Question 3

fread is a "Fast and Friendly File Finagler". It is part of the very popular data.table package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd argument in conjunction with your bash code from (2) to read the data of STATECD, LAT, and LON into a data.table in your R environment.

Items to submit
  • Code used to solve the problem.

  • The head of the resulting data.table.

Question 4

We are going to further understand the data from question (3) by finding the actual locations based on the LAT and LON columns. We can use the library revgeo to get a location given a pair of longitude and latitude values. revgeo uses a free API hosted by photon in order to do so.

For example:

revgeo(longitude=-86.926153, latitude=40.427055, output='frame')

The code above will give you the address information in six columns, from the most-granular housenumber to the least-granular country. Depending on the coordinates, revgeo may or may not give you results for each column. For this question, we are going to keep only the state column.

There are over 4 million rows in our dataset — we do not want to hit photon’s API that many times. Instead, we are going to do the following:

  • Unless you feel comfortable using data.table, convert your data.table to a data.frame:

my_dataframe <- data.frame(my_datatable)
  • Calculate the average LAT and LON for each STATECD, and call the new data.frame, dat. This should result in 57 rows of lat/long pairs.

  • For each row in dat, run a reverse geocode and append the state to a new column called STATE.

To calculate the average LAT and LON for each STATECD, you could use the sqldf package to run SQL queries on your data.frame.

mapply is a useful apply function to use to solve this problem.

Here is some extra help:

points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))
# Note that the "output" argument gets passed to the "revgeo" function.
mapply(revgeo, points$longitude, points$latitude, output="frame")
# The output isn't in a great format, and we'd prefer to just get the "state" data.
# Let's wrap "revgeo" into another function that just gets "state" and try again.
get_state <- function(lon, lat) {
  return(revgeo(lon, lat, output="frame")["state"])
mapply(get_state, points$longitude, points$latitude)

It is okay to get "Not Found" for some of the addresses.

Items to submit
  • Code used to solve the problem.

  • The head of the resulting data.frame.