merge
This page is a work in progress. Please proceed at your own risk. |
merge
is a function that can be used to combine data.frames by row or column names.
merge
can replicate the join operations in SQL.
You can run ?merge
in R for more information about the function.
Merging Two Dataframes - Example 1
dat <- read.csv("/class/datamine/data/fars/7581.csv")
state_names <- read.csv("/class/datamine/data/fars/states.csv")
We look at the heads of both data frames.
head(dat)
head(state_names)
The STATE
column of the dat data frame corresponds to the code
column of the state_names
data frame.
Now we merge these two data frames, by corresponding values from this column.
We call resulting data frame mynewDF
.
mynewDF <- merge(dat,state_names,by.x="STATE",by.y="code")
The new column, called state
(not to be confused with STATE
) is the rightmost column in this new data frame.
head(mynewDF)
Now we can use the new dataframe to calulcate fields of interest. In this example we’ll look at the average count of drunk drivers by state.
sort(tapply(mynewDF$DRUNK_DR, mynewDF$state, mean))
Merging Two Dataframes - Example 2
Here is the flight data from 1995.
Notice that, for instance, the locations of the airports are not given.
We only know the airport Origin
and Dest
codes.
myDF <- read.csv("/class/datamine/data/flights/subset/1995.csv")
Here is a listing of the information about the airports themselves:
airportsDF <- read.csv("/class/datamine/data/flights/subset/airports.csv")
We see that the 3-letter codes about the airports are given in the Origin and Dest columns of myDF
.
head(myDF)
It is harder to tell which column in the airportsDF
gives the 3-letter codes, but these are the iata codes.
head(airportsDF)
It is perhaps easier to see this from the tail of airportsDF:
tail(airportsDF)
Now we merge the two data frames, and we display the information about the Origin
airport, by linking the Origin
column of myDF
with the iata
column of airportsDF
:
mynewDF <- merge(myDF, airportsDF, by.x="Origin", by.y="iata")
The resulting data frame has the same size as myDF
:
dim(myDF)
dim(mynewDF)
but now has extra columns, namely, with information about the Origin
airport:
head(mynewDF)
tail(mynewDF)
So now we can do things like calculating a sum
of all Distances
of flights with Origin
in each state
:
sort(tapply( mynewDF$Distance, mynewDF$state, sum ))
Examples
Consider the data.frames, books
and authors
:
books
id title author_id rating
1 1 Harry Potter and the Sorcerers Stone 1 4.47
2 2 Harry Potter and the Chamber of Secrets 1 4.43
3 3 Harry Potter and the Prisoner of Azkaban 1 4.57
4 4 Harry Potter and the Goblet of Fire 1 4.56
5 5 Harry Potter and the Order of the Phoenix 1 4.50
6 6 Harry Potter and the Half Blood Prince 1 4.57
7 7 Harry Potter and the Deathly Hallows 1 4.62
8 8 The Way of Kings 2 4.64
9 9 The Book Thief 3 4.37
10 10 The Eye of the World 4 4.18
authors
id name avg_rating
1 1 J.K. Rowling 4.46
2 2 Brandon Sanderson 4.39
3 3 Markus Zusak 4.34
4 4 Robert Jordan 4.18
5 5 Agatha Christie 4.00
6 6 Alex Kava 4.02
7 7 Nassim Nicholas Taleb 3.99
8 8 Neil Gaiman 4.13
9 9 Stieg Larsson 4.16
10 10 Antoine de Saint-Exupéry 4.30
How do I merge both data.frames based on author_id
column in the books
data and the id
column in the authors
data, keeping only the rows where there is a match?
Click to see solution
# In SQL this is referred to as an INNER JOIN.
merge(books, authors, by.x="author_id", by.y="id", all=F)
author_id id title rating
1 1 1 Harry Potter and the Sorcerers Stone 4.47
2 1 2 Harry Potter and the Chamber of Secrets 4.43
3 1 3 Harry Potter and the Prisoner of Azkaban 4.57
4 1 4 Harry Potter and the Goblet of Fire 4.56
5 1 5 Harry Potter and the Order of the Phoenix 4.50
6 1 6 Harry Potter and the Half Blood Prince 4.57
7 1 7 Harry Potter and the Deathly Hallows 4.62
8 2 8 The Way of Kings 4.64
9 3 9 The Book Thief 4.37
10 4 10 The Eye of the World 4.18
name avg_rating
1 J.K. Rowling 4.46
2 J.K. Rowling 4.46
3 J.K. Rowling 4.46
4 J.K. Rowling 4.46
5 J.K. Rowling 4.46
6 J.K. Rowling 4.46
7 J.K. Rowling 4.46
8 Brandon Sanderson 4.39
9 Markus Zusak 4.34
10 Robert Jordan 4.18
How do I merge both data.frames based on author_id
column in the books
data and the id
column in the authors
data, keeping all rows regardless of whether there is a match?
Click to see solution
merge(books, authors, by.x="author_id", by.y="id", all.y=T)
author_id id title rating
1 1 1 Harry Potter and the Sorcerers Stone 4.47
2 1 2 Harry Potter and the Chamber of Secrets 4.43
3 1 3 Harry Potter and the Prisoner of Azkaban 4.57
4 1 4 Harry Potter and the Goblet of Fire 4.56
5 1 5 Harry Potter and the Order of the Phoenix 4.50
6 1 6 Harry Potter and the Half Blood Prince 4.57
7 1 7 Harry Potter and the Deathly Hallows 4.62
8 2 8 The Way of Kings 4.64
9 3 9 The Book Thief 4.37
10 4 10 The Eye of the World 4.18
11 5 NA <NA> NA
12 6 NA <NA> NA
13 7 NA <NA> NA
14 8 NA <NA> NA
15 9 NA <NA> NA
16 10 NA <NA> NA
name avg_rating
1 J.K. Rowling 4.46
2 J.K. Rowling 4.46
3 J.K. Rowling 4.46
4 J.K. Rowling 4.46
5 J.K. Rowling 4.46
6 J.K. Rowling 4.46
7 J.K. Rowling 4.46
8 Brandon Sanderson 4.39
9 Markus Zusak 4.34
10 Robert Jordan 4.18
11 Agatha Christie 4.00
12 Alex Kava 4.02
13 Nassim Nicholas Taleb 3.99
14 Neil Gaiman 4.13
15 Stieg Larsson 4.16
16 Antoine de Saint-Exupéry 4.30
# or
merge(authors, books, by.x="id", by.y="author_id", all.x=T)
id name avg_rating id.y
1 1 J.K. Rowling 4.46 1
2 1 J.K. Rowling 4.46 2
3 1 J.K. Rowling 4.46 3
4 1 J.K. Rowling 4.46 4
5 1 J.K. Rowling 4.46 5
6 1 J.K. Rowling 4.46 6
7 1 J.K. Rowling 4.46 7
8 2 Brandon Sanderson 4.39 8
9 3 Markus Zusak 4.34 9
10 4 Robert Jordan 4.18 10
11 5 Agatha Christie 4.00 NA
12 6 Alex Kava 4.02 NA
13 7 Nassim Nicholas Taleb 3.99 NA
14 8 Neil Gaiman 4.13 NA
15 9 Stieg Larsson 4.16 NA
16 10 Antoine de Saint-Exupéry 4.30 NA
title rating
1 Harry Potter and the Sorcerers Stone 4.47
2 Harry Potter and the Chamber of Secrets 4.43
3 Harry Potter and the Prisoner of Azkaban 4.57
4 Harry Potter and the Goblet of Fire 4.56
5 Harry Potter and the Order of the Phoenix 4.50
6 Harry Potter and the Half Blood Prince 4.57
7 Harry Potter and the Deathly Hallows 4.62
8 The Way of Kings 4.64
9 The Book Thief 4.37
10 The Eye of the World 4.18
11 <NA> NA
12 <NA> NA
13 <NA> NA
14 <NA> NA
15 <NA> NA
16 <NA> NA