TDM 10100: Project 10 — 2022

Motivation: As we have learned functions are foundational to more complex programs and behaviors.
There is an entire programming paradigm based on functions called functional programming.

Context: We will apply functions to entire vectors of data using sapply. We learned how to create functions, and now the next step we will take is to use it on a series of data. sapply is one of the best ways to do this in R.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/okcupid/filtered/users.csv

  • /anvil/projects/tdm/data/okcupid/filtered/questions.csv

Helpful Hint

read.csv() function automatically delineates by a comma`,`
You can use other delimiters by using adding the sep argument
i.e. read.csv(…​sep=';')

Use the readlines(…​,n=x) function to see the first x number of rows to identify what the character that you will use in the sep argument.

Questions

ONE

We want to go ahead and load the datasets into data.frames named users and questions. Take a look at both data.frames and identify what is a part of each of them. What information is in each datatset, and how they are related?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1 or 2 sentences on the datasets.

TWO

Simply put, grep helps us to find a word within a string. In R grep is vectorized and can be applied to an entire vector of strings. We will use it to find the any questions that mention google in the data.frame questions.

  1. What do you notice if you just use the function grep() and create a new variable google and then print that variable?

  2. Now that you know the row number, how can you take a look at the information there?

(Bonus question: can find a shortcut to steps a & b?)

Helpful Hint

grep - grep() is a function in R that is used to search for matches of a pattern within each element of a string.

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)
Insider Information

Just an FYI refresh:

  • is an assignment operator, it assigns values to a variable

  • Functions must be called using the round brackets aka parenthesis ()

  • Square brackets [], are also called extraction operators as they are used to help extract specific elements from a vector or matrix.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

THREE

  1. Using the row from our previous question, which variable does this correspond with in the data.frame users?

  2. Knowing that the two possible answers are "No. Why spoil the mystery?" and "Yes, Knowledge is power!" What percentage of users do NOT google someone before the first date?

Helpful Hint
  • Row 2172 in questions corresponds to column named q170849 in users

  • The table() function can be used to quickly create frequency tables

  • The prop.table() function can calculate the value of each cell in a table as a proportion of all values.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

FOUR

Using the ability to create a function AND tapply find the percentages of Female vs Male (Man vs Woman, as categorized in the users data.frame) who DO google someone before their date.

Helpful Hint
  • tapply() function can be used to apply some function to a vector that has been grouped by another vector. tapply(x, INDEX, FUNCTION)

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

FIVE

Using the ability to create a function AND using sapply() write a function that takes the string and removes everything after/including the _ from the gender_orientation column in the users data.frame. Or it is OK to solve this question as given in the video, without a function and without sapply().

meaning that Hetero_male → Hetero, we want to do this for the entire column gender_orientation

Insider Information

Sapply()- allows you to iterate over a list or vector without the need to use a for loop which is typically a slow way to work in R.

Remember the difference
(a very brief summary of each)

  • A vector is the basic data structure in R they typically are atomic vectors and lists and have three common properties

  • Type- typeof()

  • Length- length()

  • Attributes- attributes() They are different due to the type of elements they hold. All elements in an atomic vector must be the same(they are also always "flat"), but elements of a list can be different types. construction of lists are done by using the function list(). The construction of atomic vectors are done by using the function c(). You can determine specific type by using functions like is.character(), is.double(), is.integer(), is.logical()

  • A matrix is a two-dimensional; rows and columns and all cells must be the same type. Can be created with the function matrix().

  • An array can be one dimension multi-dimensional. An array with one dimension is similar (but not exact) as a vector. An array with two dimensions is similar (but not exact) as a matrix. An array with three or more dimensions is an n-dimensional array. can be created with the function array().

  • A data frame is like a table, or like a matrix, BUT the columns can hold different types of data.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Resources

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.