merge

This page is a work in progress. Please proceed at your own risk.

merge is a function that can be used to combine data.frames by row or column names. merge can replicate the join operations in SQL. You can run ?merge in R for more information about the function.

Merging Two Dataframes - Example 1

dat <- read.csv("/class/datamine/data/fars/7581.csv")
state_names <- read.csv("/class/datamine/data/fars/states.csv")

We look at the heads of both data frames.

head(dat)
head(state_names)

The STATE column of the dat data frame corresponds to the code column of the state_names data frame.

Now we merge these two data frames, by corresponding values from this column.

We call resulting data frame mynewDF.

mynewDF <- merge(dat,state_names,by.x="STATE",by.y="code")

The new column, called state (not to be confused with STATE) is the rightmost column in this new data frame.

head(mynewDF)

Now we can use the new dataframe to calulcate fields of interest. In this example we’ll look at the average count of drunk drivers by state.

sort(tapply(mynewDF$DRUNK_DR, mynewDF$state, mean))

Merging Two Dataframes - Example 2

Here is the flight data from 1995.

Notice that, for instance, the locations of the airports are not given.

We only know the airport Origin and Dest codes.

myDF <- read.csv("/class/datamine/data/flights/subset/1995.csv")

Here is a listing of the information about the airports themselves:

airportsDF <- read.csv("/class/datamine/data/flights/subset/airports.csv")

We see that the 3-letter codes about the airports are given in the Origin and Dest columns of myDF.

head(myDF)

It is harder to tell which column in the airportsDF gives the 3-letter codes, but these are the iata codes.

head(airportsDF)

It is perhaps easier to see this from the tail of airportsDF:

tail(airportsDF)

Now we merge the two data frames, and we display the information about the Origin airport, by linking the Origin column of myDF with the iata column of airportsDF:

mynewDF <- merge(myDF, airportsDF, by.x="Origin", by.y="iata")

The resulting data frame has the same size as myDF:

dim(myDF)
dim(mynewDF)

but now has extra columns, namely, with information about the Origin airport:

head(mynewDF)
tail(mynewDF)

So now we can do things like calculating a sum of all Distances of flights with Origin in each state:

sort(tapply( mynewDF$Distance, mynewDF$state, sum ))

Examples

Consider the data.frames, books and authors:

books
   id                                     title author_id rating
1   1     Harry Potter and the Sorcerers Stone          1   4.47
2   2   Harry Potter and the Chamber of Secrets         1   4.43
3   3  Harry Potter and the Prisoner of Azkaban         1   4.57
4   4       Harry Potter and the Goblet of Fire         1   4.56
5   5 Harry Potter and the Order of the Phoenix         1   4.50
6   6    Harry Potter and the Half Blood Prince         1   4.57
7   7      Harry Potter and the Deathly Hallows         1   4.62
8   8                          The Way of Kings         2   4.64
9   9                            The Book Thief         3   4.37
10 10                      The Eye of the World         4   4.18
authors
   id                     name avg_rating
1   1             J.K. Rowling       4.46
2   2        Brandon Sanderson       4.39
3   3             Markus Zusak       4.34
4   4            Robert Jordan       4.18
5   5          Agatha Christie       4.00
6   6                Alex Kava       4.02
7   7    Nassim Nicholas Taleb       3.99
8   8              Neil Gaiman       4.13
9   9            Stieg Larsson       4.16
10 10 Antoine de Saint-Exupéry       4.30

How do I merge both data.frames based on author_id column in the books data and the id column in the authors data, keeping only the rows where there is a match?

Click to see solution
# In SQL this is referred to as an INNER JOIN.
merge(books, authors, by.x="author_id", by.y="id", all=F)
   author_id id                                     title rating
1          1  1     Harry Potter and the Sorcerers Stone    4.47
2          1  2   Harry Potter and the Chamber of Secrets   4.43
3          1  3  Harry Potter and the Prisoner of Azkaban   4.57
4          1  4       Harry Potter and the Goblet of Fire   4.56
5          1  5 Harry Potter and the Order of the Phoenix   4.50
6          1  6    Harry Potter and the Half Blood Prince   4.57
7          1  7      Harry Potter and the Deathly Hallows   4.62
8          2  8                          The Way of Kings   4.64
9          3  9                            The Book Thief   4.37
10         4 10                      The Eye of the World   4.18
                name avg_rating
1       J.K. Rowling       4.46
2       J.K. Rowling       4.46
3       J.K. Rowling       4.46
4       J.K. Rowling       4.46
5       J.K. Rowling       4.46
6       J.K. Rowling       4.46
7       J.K. Rowling       4.46
8  Brandon Sanderson       4.39
9       Markus Zusak       4.34
10     Robert Jordan       4.18

How do I merge both data.frames based on author_id column in the books data and the id column in the authors data, keeping all rows regardless of whether there is a match?

Click to see solution
merge(books, authors, by.x="author_id", by.y="id", all.y=T)
   author_id id                                     title rating
1          1  1     Harry Potter and the Sorcerers Stone    4.47
2          1  2   Harry Potter and the Chamber of Secrets   4.43
3          1  3  Harry Potter and the Prisoner of Azkaban   4.57
4          1  4       Harry Potter and the Goblet of Fire   4.56
5          1  5 Harry Potter and the Order of the Phoenix   4.50
6          1  6    Harry Potter and the Half Blood Prince   4.57
7          1  7      Harry Potter and the Deathly Hallows   4.62
8          2  8                          The Way of Kings   4.64
9          3  9                            The Book Thief   4.37
10         4 10                      The Eye of the World   4.18
11         5 NA                                      <NA>     NA
12         6 NA                                      <NA>     NA
13         7 NA                                      <NA>     NA
14         8 NA                                      <NA>     NA
15         9 NA                                      <NA>     NA
16        10 NA                                      <NA>     NA
                       name avg_rating
1              J.K. Rowling       4.46
2              J.K. Rowling       4.46
3              J.K. Rowling       4.46
4              J.K. Rowling       4.46
5              J.K. Rowling       4.46
6              J.K. Rowling       4.46
7              J.K. Rowling       4.46
8         Brandon Sanderson       4.39
9              Markus Zusak       4.34
10            Robert Jordan       4.18
11          Agatha Christie       4.00
12                Alex Kava       4.02
13    Nassim Nicholas Taleb       3.99
14              Neil Gaiman       4.13
15            Stieg Larsson       4.16
16 Antoine de Saint-Exupéry       4.30
# or

merge(authors, books, by.x="id", by.y="author_id", all.x=T)
   id                     name avg_rating id.y
1   1             J.K. Rowling       4.46    1
2   1             J.K. Rowling       4.46    2
3   1             J.K. Rowling       4.46    3
4   1             J.K. Rowling       4.46    4
5   1             J.K. Rowling       4.46    5
6   1             J.K. Rowling       4.46    6
7   1             J.K. Rowling       4.46    7
8   2        Brandon Sanderson       4.39    8
9   3             Markus Zusak       4.34    9
10  4            Robert Jordan       4.18   10
11  5          Agatha Christie       4.00   NA
12  6                Alex Kava       4.02   NA
13  7    Nassim Nicholas Taleb       3.99   NA
14  8              Neil Gaiman       4.13   NA
15  9            Stieg Larsson       4.16   NA
16 10 Antoine de Saint-Exupéry       4.30   NA
                                       title rating
1      Harry Potter and the Sorcerers Stone    4.47
2    Harry Potter and the Chamber of Secrets   4.43
3   Harry Potter and the Prisoner of Azkaban   4.57
4        Harry Potter and the Goblet of Fire   4.56
5  Harry Potter and the Order of the Phoenix   4.50
6     Harry Potter and the Half Blood Prince   4.57
7       Harry Potter and the Deathly Hallows   4.62
8                           The Way of Kings   4.64
9                             The Book Thief   4.37
10                      The Eye of the World   4.18
11                                      <NA>     NA
12                                      <NA>     NA
13                                      <NA>     NA
14                                      <NA>     NA
15                                      <NA>     NA
16                                      <NA>     NA