# STAT 19000: Project 10 — Fall 2020

Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using `sapply`.

Context: We’ve just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. `sapply` is one of the best ways to do this in R.

Scope: r, sapply, functions

Learning objectives
• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Utilize apply functions in order to solve a data-driven problem.

• Gain proficiency using split, merge, and subset.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/okcupid/filtered`

## Questions

 Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

### Question 1

Load up the the following datasets into data.frames named `users` and `questions`, respectively: `/class/datamine/data/okcupid/filtered/users.csv`, `/class/datamine/data/okcupid/filtered/questions.csv`. This is data from users on OkCupid, an online dating app. In your own words, explain what each file contains and how they are related — its always a good idea to poke around the data to get a better understanding of how things are structured!

 Be careful, just because a file ends in `.csv`, does not mean it is comma-separated. You can change what separator `read.csv` uses with the `sep` argument. You can use the `readLines` function on a file (say, with `n=10`, for instance), to see the first lines of a file, and determine the character to use with the `sep` argument.
Items to submit
• R code used to solve the problem.

• 1-2 sentences describing what each file contains and how they are related.

### Question 2

`grep` is an incredibly powerful tool available to us in R. We will learn more about `grep` in the future, but for now, know that a simple application of `grep` is to find a word in a string. In R, `grep` is vectorized and can be applied to an entire vector of strings. Use `grep` to find a question that references "google". What is the question?

 If at first you don’t succeed, run `?grep` and check out the `ignore.case` argument.
 To prepare for Question 3, look at the entire row of the `questions` data frame that has the question about google. The first entry on this row tells you the question that you need, in the `users` data frame, while working on Question 3.
Items to submit
• R code used to solve the problem.

• The `text` of the question that references Google.

### Question 3

In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by `gender2`)? How about by `gender_orientation`?

 The two videos posted in Question 2 might help.
 If you look at the column of `users` corresponding to the question identified in (2), you will see that this column of `users` has two possible answers, namely: `"No. Why spoil the mystery?"` and `"Yes. Knowledge is power!"`.
 Use the `tapply` function with three inputs:

the correct column of `users`,

breaking up the data according to `gender2` or according to `gender_orientation`,

and use this as your function in the `tapply`:

`function(x) {prop.table(table(x, useNA="always"))}`

Items to submit
• R code used to solve this problem.

• The results of running the code.

• Written answers to the questions.

### Question 4

In Project 8, we created a function called `count_words`. Use this function and `sapply` to create a vector which contains the number of words in each row of the column `text` from the `questions` dataframe. Call the new vector `question_length`, and add it as a column to the `questions` dataframe.

``````count_words <- function(my_text) {
my_split_text <- unlist(strsplit(my_text, " "))

return(length(my_split_text[my_split_text!=""]))
}``````
Items to submit
• R code used to solve this problem.

• The result of `str(questions)` (this shows how your `questions` data frame looks, after adding the new column called `question_length`).

### Question 5

Consider this function called `number_of_options` that accepts a data frame (for instance, `questions`)…​

``````number_of_options <- function(myDF) {
table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}``````

…​and counts the number of questions that have each possible number of responses. For instance, if we calculate `number_of_options(questions)` we get:

```` 0 2 3 4 590 936 519 746 ````

which means that: 590 questions have 0 possible responses; 936 questions have 2 possible responses; 519 questions have 3 possible responses; and 746 questions have 4 possible responses.

Now use the `split` function to break the data frame `questions` into 7 smaller data frames, according to the value in `questions\$Keywords`. Then use the `sapply` function to determine, for each possible value of `questions\$Keywords`, the analogous breakdown of questions with different numbers of responses, as we did above.

 You can write: ``````mylist <- split(questions, questions\$Keywords) sapply(mylist, number_of_options)``````

The way `sapply` works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example:

``````test1 <- c(1, 2, 3, 4, NA, 5)
test2 <- c(9, 8, 6, 5, 4, NA)
mylist <- list(first=test1, second=test2)
# for a single vector in the list
mean(mylist\$first, na.rm=T)
# what if we want to do this for each vector in the list?
# how do we remove na's?
sapply(mylist, mean)
# we can specify the arguments that are for the mean function
# by naming them after the first two arguments, like this
sapply(mylist, mean, na.rm=T)
# in the code shown above, na.rm=T is passed to the mean function
# just like if you run the following
mean(mylist\$first, na.rm=T)
mean(mylist\$second, na.rm=T)
# you can include as many arguments to mean as you normally would
# and in any order. just make sure to name the arguments
sapply(mylist, mean, na.rm=T, trim=0.5)
# or sapply(mylist, mean, trim=0.5, na.rm=T)
# which is similar to
mean(mylist\$first, na.rm=T, trim=0.5)
mean(mylist\$second, na.rm=T, trim=0.5)``````
Items to submit
• R code used to solve this problem.

• The results of the running the code.

### Question 6

Lots of questions are asked in this `okcupid` dataset. Explore the dataset, and either calculate an interesting statistic/result using `sapply`, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you’d find, and what you actually discovered.

Items to submit
• R code used to solve this problem.

• The results from running your code.

• 1-2 sentences about your analysis and/or graphic, and explain what you thought you’d find, and what you actually discovered.

### OPTIONAL QUESTION

Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called `percent_answered` that, given a vector, returns the percentage of values that are not NA. Use `percent_answered` and `sapply` to calculate the percentage of users who answer each question. Plot this result, against the length of the questions.

 `length_of_questions ← questions\$question_length[grep("^q", questions\$X)]`
 `grep("^q", questions\$X)` returns the column index of every column that starts with "q". Use the same trick we used in the previous hint, to subset our `users` data.frame before using `sapply` to apply `percent_answered`.
Items to submit
• R code used to solve this problem.

• The plot.

• Whether or not you think there may or may not be an association between question length and whether or not the question is answered.