Joins Tutorial

Author

Michael Hallquist, PSYC 859

Published

March 12, 2026

Introduction to joining datasets using dplyr

This document has been adapted and extended by Michael Hallquist and Benjamin Johnson from Jenny Bryan’s dplyr joins tutorial (http://stat545.com/bit001_dplyr-cheatsheet.html). Animations were developed by Garrick Aden-Buie. The goal is to develop an intuition of the four major types of two-table join operations: inner, left, right, and full. We’ll also get into using joins to identify areas of match or mismatch between two datasets (using semi- and anti-joins).

Warning: The `file` argument of `read_csv()` should use `I()` for literal data as of
readr 2.2.0.
  
  # Bad (for example):
  read_csv("x,y\n1,2")
  
  # Good:
  read_csv(I("x,y\n1,2"))

Superheroes table

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

Publishers Table

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

Mutating joins: inner, left, right, full

“Mutating” joins combine variables from two datasets on the basis of one or more keys that match between datasets. In the case of these datasets, notice they share the “publisher” column.

N.B.: By default, dplyr will search for common columns across datasets as the matching keys (natural join). If you want to control the process, specify the key using by.

inner join

Require match in both datasets (non-matching rows are dropped)

For those of you who are visual learners, conceptually, imagine the following two simple datasets:

An inner join combines the two datasets and drops the non-matching rows like so:

Let’s try it with our superhero data.

Code

#*NB*: Here, we retain the joined dataset as ijsp
ijsp <- inner_join(x=superheroes, y=publishers)
print(ijsp)

# A tibble: 6 × 5
  name     alignment gender publisher yr_founded
  <chr>    <chr>     <chr>  <chr>          <dbl>
1 Magneto  bad       male   Marvel          1939
2 Storm    good      female Marvel          1939
3 Mystique bad       female Marvel          1939
4 Batman   good      male   DC              1934
5 Joker    bad       male   DC              1934
6 Catwoman bad       female DC              1934

Same idea, just explicit declaration of key (i.e., “publisher”)

Code

#note that we've cut the x=, y= as this optional
inner_join(superheroes, publishers, by="publisher")

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934

Notice both Hellboy (from the superheroes dataset) and Image comics (from the publishers dataset) were dropped.

left join

Keep all rows in left-hand ‘x’ dataset (i.e., superheroes). Add columns from publishers where there is a match. Fill in NA for non-matching observations.

Code

left_join(superheroes, publishers, by="publisher")

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Hellboy	good	male	Dark Horse Comics	NA

right join

Keep all rows in right-hand ‘y’ dataset (i.e., publishers). Add columns from superheroes where there is a match. Fill in NA for non-matching observations.

Code

# Note the shift to using dplyr piping
# This achieves the same purpose, but may be preferred by those who love pipes
superheroes %>% right_join(publishers, by="publisher")

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
NA	NA	NA	Image	1992

full join

Keep all rows in left-hand ‘x’ (superheroes) and right-hand ‘y’ (publishers) datasets.

Resulting dataset will have all columns of both datasets, but filling in NA for any non-matches on either side (denoted as blank spaces below).

Code

superheroes %>% full_join(publishers, by="publisher")

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Hellboy	good	male	Dark Horse Comics	NA
NA	NA	NA	Image	1992

One-to-many join

Note that when there are non-unique matches, the join adds all possible combinations.

This occurs in a one-to-many join, where a single observation in table A relates to multiple observations in table B. For this scenario, the single table A observation will be replicated onto multiple rows to match the multiple observations in table B.

Let’s say you wanted to examine how mood ratings provided over many days relate to demographic characteristics such as age.

person	day	mood
Jeff	1	3
Jeff	2	3
Jeff	3	10
Jeff	4	2
Jeff	5	6
Jeff	6	5

person	bio_sex	age	height_in
Jeff	Male	15	66
Ping	Male	18	71
Karla	Female	19	60

person	bio_sex	age	height_in	day	mood
Jeff	Male	15	66	1	3
Jeff	Male	15	66	2	3
Jeff	Male	15	66	3	10
Jeff	Male	15	66	4	2
Jeff	Male	15	66	5	6
Jeff	Male	15	66	6	5
Ping	Male	18	71	1	4
Ping	Male	18	71	2	6
Ping	Male	18	71	3	9
Ping	Male	18	71	4	10
Ping	Male	18	71	5	5
Ping	Male	18	71	6	3
Karla	Female	19	60	1	9
Karla	Female	19	60	2	9
Karla	Female	19	60	3	9
Karla	Female	19	60	4	3
Karla	Female	19	60	5	8
Karla	Female	19	60	6	10

Notice how the demographic variables are replicated on rows for each day of mood ratings. If both tables have duplicates for the key, joins can produce many-to-many combinations and a large increase in rows.

Filtering joins: semi_join and anti_join

Filtering joins use specific criteria to identify observations (rows) from one table that exist or don’t exist in another table.

These joins are typically used for diagnosing mismatch between two overlapping datasets. That is, filtering joins are most useful for data quality assurance checks.

semi_join

retain observations (rows) in x that match in y

Observations in superheroes that match in publishers

Notice that this is different from the left_join shown above as the data from y is not kept. That is the fundamental difference between ‘mutating joins’ (e.g., left_join) and ‘filtering joins’ (e.g., semi_join).

Code

semi_join(superheroes, publishers, by="publisher")

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC

Observations in publishers that match in superheroes

Code

semi_join(publishers, superheroes, by="publisher")

publisher	yr_founded
DC	1934
Marvel	1939

This can be useful if you have a dataset of your data of interest and another dataset that indicates which of your participants/observations you want to remove or filter out.

anti_join

observations in x that are not matched in y Note that this is similar to setdiff in base R

Observations in superheroes that don’t match in publishers

Code

anti_join(superheroes, publishers, by="publisher")

name	alignment	gender	publisher
Hellboy	good	male	Dark Horse Comics

Observations in publishers that don’t match in superheroes

Code

publishers %>% anti_join(superheroes, by="publisher")

publisher	yr_founded
Image	1992

This can be useful if you are trying to identify extra participants/observations that may have snuck into one dataset (x) or been deleted in another (y).

Joining multiple datasets

Joining can be done repeatedly across multiple datasets. The following code, for instance, joins datasets two at a time from left to right in the list. The result of a two-table join becomes the ‘x’ dataset for the next join of a new dataset ‘y’.

Code

set.seed(123)
df1 <- data.frame(id=1:10, x=rnorm(10), y=runif(10))
df2 <- data.frame(id=1:11, z=rnorm(11), a=runif(11))
df3 <- data.frame(id=2:10, b=rnorm(9), c=runif(9))

print(df1)

   id           x         y
1   1 -0.56047565 0.8895393
2   2 -0.23017749 0.6928034
3   3  1.55870831 0.6405068
4   4  0.07050839 0.9942698
5   5  0.12928774 0.6557058
6   6  1.71506499 0.7085305
7   7  0.46091621 0.5440660
8   8 -1.26506123 0.5941420
9   9 -0.68685285 0.2891597
10 10 -0.44566197 0.1471136

Code

print(df2)

   id          z          a
1   1  1.7869131 0.79892485
2   2  0.4978505 0.12189926
3   3 -1.9666172 0.56094798
4   4  0.7013559 0.20653139
5   5 -0.4727914 0.12753165
6   6 -1.0678237 0.75330786
7   7 -0.2179749 0.89504536
8   8 -1.0260044 0.37446278
9   9 -0.7288912 0.66511519
10 10 -0.6250393 0.09484066
11 11 -1.6866933 0.38396964

Code

print(df3)

  id          b         c
1  2 -0.5996083 0.6680556
2  3 -0.1294107 0.4176468
3  4  0.8867361 0.7881958
4  5 -0.1513960 0.1028646
5  6  0.3297912 0.4348927
6  7 -3.2273228 0.9849570
7  8 -0.7717918 0.8930511
8  9  0.2865486 0.8864691
9 10 -1.2205120 0.1750527

Code

dftemp <- full_join(df1, df2, by="id")
dffinal <- full_join(dftemp, df3, by="id")

#alternative way to combine that avoids temporary variables
Reduce(function(x, y) full_join(x, y, by="id"), list(df1, df2, df3))

id	x	y	z	a	b	c
1	-0.5604756	0.8895393	1.7869131	0.7989248	NA	NA
2	-0.2301775	0.6928034	0.4978505	0.1218993	-0.5996083	0.6680556
3	1.5587083	0.6405068	-1.9666172	0.5609480	-0.1294107	0.4176468
4	0.0705084	0.9942698	0.7013559	0.2065314	0.8867361	0.7881958
5	0.1292877	0.6557058	-0.4727914	0.1275317	-0.1513960	0.1028646
6	1.7150650	0.7085305	-1.0678237	0.7533079	0.3297912	0.4348927
7	0.4609162	0.5440660	-0.2179749	0.8950454	-3.2273228	0.9849570
8	-1.2650612	0.5941420	-1.0260044	0.3744628	-0.7717918	0.8930511
9	-0.6868529	0.2891597	-0.7288912	0.6651152	0.2865486	0.8864691
10	-0.4456620	0.1471136	-0.6250393	0.0948407	-1.2205120	0.1750527
11	NA	NA	-1.6866933	0.3839696	NA	NA

Alternative using pipeline (less extensible)

Code

mergedf <- df1 %>% full_join(df2, by="id") %>% full_join(df3, by="id")