Joins Tutorial

Author

Michael Hallquist, PSYC 859

Published

March 12, 2026

Introduction to joining datasets using dplyr

This document has been adapted and extended by Michael Hallquist and Benjamin Johnson from Jenny Bryan’s dplyr joins tutorial (http://stat545.com/bit001_dplyr-cheatsheet.html). Animations were developed by Garrick Aden-Buie. The goal is to develop an intuition of the four major types of two-table join operations: inner, left, right, and full. We’ll also get into using joins to identify areas of match or mismatch between two datasets (using semi- and anti-joins).

Warning: The `file` argument of `read_csv()` should use `I()` for literal data as of
readr 2.2.0.
  
  # Bad (for example):
  read_csv("x,y\n1,2")
  
  # Good:
  read_csv(I("x,y\n1,2"))

Superheroes table

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

Publishers Table

publisher yr_founded
DC 1934
Marvel 1939
Image 1992

Mutating joins: inner, left, right, full

“Mutating” joins combine variables from two datasets on the basis of one or more keys that match between datasets. In the case of these datasets, notice they share the “publisher” column.

N.B.: By default, dplyr will search for common columns across datasets as the matching keys (natural join). If you want to control the process, specify the key using by.

inner join

Require match in both datasets (non-matching rows are dropped)

For those of you who are visual learners, conceptually, imagine the following two simple datasets:

An inner join combines the two datasets and drops the non-matching rows like so:

Let’s try it with our superhero data.

Code
#*NB*: Here, we retain the joined dataset as ijsp
ijsp <- inner_join(x=superheroes, y=publishers)
print(ijsp)
# A tibble: 6 × 5
  name     alignment gender publisher yr_founded
  <chr>    <chr>     <chr>  <chr>          <dbl>
1 Magneto  bad       male   Marvel          1939
2 Storm    good      female Marvel          1939
3 Mystique bad       female Marvel          1939
4 Batman   good      male   DC              1934
5 Joker    bad       male   DC              1934
6 Catwoman bad       female DC              1934

Same idea, just explicit declaration of key (i.e., “publisher”)

Code
#note that we've cut the x=, y= as this optional
inner_join(superheroes, publishers, by="publisher")
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934

Notice both Hellboy (from the superheroes dataset) and Image comics (from the publishers dataset) were dropped.

left join

Keep all rows in left-hand ‘x’ dataset (i.e., superheroes). Add columns from publishers where there is a match. Fill in NA for non-matching observations.

Code
left_join(superheroes, publishers, by="publisher")
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA

right join

Keep all rows in right-hand ‘y’ dataset (i.e., publishers). Add columns from superheroes where there is a match. Fill in NA for non-matching observations.

Code
# Note the shift to using dplyr piping
# This achieves the same purpose, but may be preferred by those who love pipes
superheroes %>% right_join(publishers, by="publisher")
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
NA NA NA Image 1992

full join

Keep all rows in left-hand ‘x’ (superheroes) and right-hand ‘y’ (publishers) datasets.

Resulting dataset will have all columns of both datasets, but filling in NA for any non-matches on either side (denoted as blank spaces below).

Code
superheroes %>% full_join(publishers, by="publisher")
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA
NA NA NA Image 1992

One-to-many join

Note that when there are non-unique matches, the join adds all possible combinations.

This occurs in a one-to-many join, where a single observation in table A relates to multiple observations in table B. For this scenario, the single table A observation will be replicated onto multiple rows to match the multiple observations in table B.

Let’s say you wanted to examine how mood ratings provided over many days relate to demographic characteristics such as age.

person day mood
Jeff 1 3
Jeff 2 3
Jeff 3 10
Jeff 4 2
Jeff 5 6
Jeff 6 5
person bio_sex age height_in
Jeff Male 15 66
Ping Male 18 71
Karla Female 19 60
person bio_sex age height_in day mood
Jeff Male 15 66 1 3
Jeff Male 15 66 2 3
Jeff Male 15 66 3 10
Jeff Male 15 66 4 2
Jeff Male 15 66 5 6
Jeff Male 15 66 6 5
Ping Male 18 71 1 4
Ping Male 18 71 2 6
Ping Male 18 71 3 9
Ping Male 18 71 4 10
Ping Male 18 71 5 5
Ping Male 18 71 6 3
Karla Female 19 60 1 9
Karla Female 19 60 2 9
Karla Female 19 60 3 9
Karla Female 19 60 4 3
Karla Female 19 60 5 8
Karla Female 19 60 6 10

Notice how the demographic variables are replicated on rows for each day of mood ratings. If both tables have duplicates for the key, joins can produce many-to-many combinations and a large increase in rows.

Filtering joins: semi_join and anti_join

Filtering joins use specific criteria to identify observations (rows) from one table that exist or don’t exist in another table.

These joins are typically used for diagnosing mismatch between two overlapping datasets. That is, filtering joins are most useful for data quality assurance checks.

semi_join

retain observations (rows) in x that match in y

Observations in superheroes that match in publishers

Notice that this is different from the left_join shown above as the data from y is not kept. That is the fundamental difference between ‘mutating joins’ (e.g., left_join) and ‘filtering joins’ (e.g., semi_join).

Code
semi_join(superheroes, publishers, by="publisher")
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC

Observations in publishers that match in superheroes

Code
semi_join(publishers, superheroes, by="publisher")
publisher yr_founded
DC 1934
Marvel 1939

This can be useful if you have a dataset of your data of interest and another dataset that indicates which of your participants/observations you want to remove or filter out.

anti_join

observations in x that are not matched in y Note that this is similar to setdiff in base R

Observations in superheroes that don’t match in publishers

Code
anti_join(superheroes, publishers, by="publisher")
name alignment gender publisher
Hellboy good male Dark Horse Comics

Observations in publishers that don’t match in superheroes

Code
publishers %>% anti_join(superheroes, by="publisher")
publisher yr_founded
Image 1992

This can be useful if you are trying to identify extra participants/observations that may have snuck into one dataset (x) or been deleted in another (y).

Joining multiple datasets

Joining can be done repeatedly across multiple datasets. The following code, for instance, joins datasets two at a time from left to right in the list. The result of a two-table join becomes the ‘x’ dataset for the next join of a new dataset ‘y’.

Code
set.seed(123)
df1 <- data.frame(id=1:10, x=rnorm(10), y=runif(10))
df2 <- data.frame(id=1:11, z=rnorm(11), a=runif(11))
df3 <- data.frame(id=2:10, b=rnorm(9), c=runif(9))

print(df1)
   id           x         y
1   1 -0.56047565 0.8895393
2   2 -0.23017749 0.6928034
3   3  1.55870831 0.6405068
4   4  0.07050839 0.9942698
5   5  0.12928774 0.6557058
6   6  1.71506499 0.7085305
7   7  0.46091621 0.5440660
8   8 -1.26506123 0.5941420
9   9 -0.68685285 0.2891597
10 10 -0.44566197 0.1471136
Code
print(df2)
   id          z          a
1   1  1.7869131 0.79892485
2   2  0.4978505 0.12189926
3   3 -1.9666172 0.56094798
4   4  0.7013559 0.20653139
5   5 -0.4727914 0.12753165
6   6 -1.0678237 0.75330786
7   7 -0.2179749 0.89504536
8   8 -1.0260044 0.37446278
9   9 -0.7288912 0.66511519
10 10 -0.6250393 0.09484066
11 11 -1.6866933 0.38396964
Code
print(df3)
  id          b         c
1  2 -0.5996083 0.6680556
2  3 -0.1294107 0.4176468
3  4  0.8867361 0.7881958
4  5 -0.1513960 0.1028646
5  6  0.3297912 0.4348927
6  7 -3.2273228 0.9849570
7  8 -0.7717918 0.8930511
8  9  0.2865486 0.8864691
9 10 -1.2205120 0.1750527
Code
dftemp <- full_join(df1, df2, by="id")
dffinal <- full_join(dftemp, df3, by="id")

#alternative way to combine that avoids temporary variables
Reduce(function(x, y) full_join(x, y, by="id"), list(df1, df2, df3))
id x y z a b c
1 -0.5604756 0.8895393 1.7869131 0.7989248 NA NA
2 -0.2301775 0.6928034 0.4978505 0.1218993 -0.5996083 0.6680556
3 1.5587083 0.6405068 -1.9666172 0.5609480 -0.1294107 0.4176468
4 0.0705084 0.9942698 0.7013559 0.2065314 0.8867361 0.7881958
5 0.1292877 0.6557058 -0.4727914 0.1275317 -0.1513960 0.1028646
6 1.7150650 0.7085305 -1.0678237 0.7533079 0.3297912 0.4348927
7 0.4609162 0.5440660 -0.2179749 0.8950454 -3.2273228 0.9849570
8 -1.2650612 0.5941420 -1.0260044 0.3744628 -0.7717918 0.8930511
9 -0.6868529 0.2891597 -0.7288912 0.6651152 0.2865486 0.8864691
10 -0.4456620 0.1471136 -0.6250393 0.0948407 -1.2205120 0.1750527
11 NA NA -1.6866933 0.3839696 NA NA

Alternative using pipeline (less extensible)

Code
mergedf <- df1 %>% full_join(df2, by="id") %>% full_join(df3, by="id")