# STAT 19000: Project 15 — Fall 2020

Motivation: Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester.

Context: This is the final project of the semester. We sincerely hope that you’ve learned something, and that we’ve provided you with first hand experience digging through data.

Scope: r

Learning objectives
• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Utilize apply functions in order to solve a data-driven problem.

• Gain proficiency using split, merge, and subset.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/donorschoose/`

## Questions

### Question 1

Read the data `/class/datamine/data/donorschoose/Projects.csv` into a data.frame called `projects`. Make sure you use the function you learned in Project 13 (`fread`) from the `data.table` package to read the data. Don’t forget to then convert the `data.table` into a `data.frame`. Let’s do an initial exploration of this data. What types of projects (`Project.Type`) are there? How many resource categories (`Project.Resource.Category`) are there?

Items to submit
• R code used to solve the question.

• 1-2 sentences containing the project’s types and how many resource categories are in the dataset.

### Question 2

Create two new variables in `projects`, the number of days a project lasted and the number of days until the project was fully funded. Name those variables `project_duration` and `time_until_funded`, respectively. To calculate them use the project’s posted date (`Project.Posted.Date`), expiration date (`Project.Expiration.Date`), and fully funded date (`Project.Fully.Funded.Date`). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project’s duration. If so, how many?

 You may find the argument `units` in the function `difftime` useful.
 Be sure to pay attention to the order of operations of `difftime`.
 Note that if you used the `fread` function from `data.table` to read in the data, you will not need to convert the columns as date.
 It is not required that you use `difftime`.
Items to submit
• R code used to solve the question.

• Shortest and longest times until a project is fully funded.

• 1-2 sentences answering whether we have if we have negative project’s duration, and if so how many.

### Question 3

As you noted in question (2) there may be some project’s with negative duration time. As we may have some concerns for the data regarding these projects, filter the `projects` data to exclude the projects with negative duration, and call this filtered data `selected_projects`. With that filtered data, make a `dotchart` for mean time until the project is fully funded (`time_until_funded`) for the various resource categories (`Project.Resource.Category`). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one.

 You will first need to average time until fully funded for the different categories before making your plot.
 To make your `dotchart` look nicer, you may want to first order the average time until fully funded before passing it to the `dotchart` function. In addition, consider reducing the y-axis font size using the argument `cex`.
Items to submit
• R code used to solve the question.

• Resulting dotchart.

• 1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don’t forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples.

### Question 4

Read `/class/datamine/data/donorschoose/Schools.csv` into a data.frame called `schools`. Combine `selected_projects` and `schools` by `School.ID` keeping only `School.ID`s present in both datasets. Name the combined data.frame `selected_projects`. Use the newly combined data to determine the percentage of already fully funded projects (`Project.Current.Status`) for schools in West Lafayette, IN. In addition, determine the state (`School.State`) with the highest number of projects. Be sure to specify the number of projects this state has.

 West Lafayette, IN zip codes are 47906 and 47907.
Items to submit
• R code used to solve the question.

• 1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has.

### Question 5

Using the combined `selected_projects` data, get the school(s) (`School.Name`), city/cities (`School.City`) and state(s) (`School.State`) for the teacher with the highest percentage of fully funded projects (`Project.Current.Status`).

 There are many ways to solve this problem. For example, one option to get the teacher’s ID is to create a variable indicating whether or not the project is fully funded and use `tapply`. Another option is to create `prop.table` and select the corresponding column/row.
 Note that each row in the data corresponds to a unique project ID.
 Once you have the teacher’s ID, consider filtering `projects` to contain only rows for which the corresponding teacher’s ID is in, and only the columns we are interested in: `School.Name`, `School.City`, and `School.State`. Then, you can get the unique values in this shortened data.
 To get only certain columns when subetting, you may find the argument `select` from `subset` useful.
Items to submit
• R code used to solve the question.

• Output of your code containing school(s), city(s) and state(s) of the selected teacher.