Improving Your Data Pipelines: Tidyverse Best Practices for QA
Author
Michael Hallquist
Published
April 16, 2026
Introduction
You’ve all made excellent progress in learning how to wrangle, join, and QA data in R! After reviewing your QA projects, I want to highlight a few common patterns that can make data pipelines brittle, hard to read, or prone to silent errors. These challenges came up in several QA projects, so thought I would prepare this document for everyone.
In this document, we will cover four next-level tidyverse practices that will make your code more robust and professional:
Wide, repetitive columns vs. pivot_longer()
Treating RMarkdown as a reproducible pipeline, not a scratchpad
Join fidelity and the tidylog package
Using crosswalks (lookup tables) instead of longer case_when() and ifelse()
1. Wide, Repetitive Columns vs. pivot_longer()
When working with longitudinal data or data from multiple sites (e.g., Wave 1 vs Wave 2, or Site A vs Site B), a common instinct is to create “wide” datasets where every combination of measure and timepoint gets its own dedicated column (e.g., wave1_anxiety, wave2_anxiety).
For example, imagine a dataset where we measure penguin flipper length and body mass at two different time points (baseline and follow-up).
Code
head(wide_penguins)
id
species
island
sex
baseline_flipper_len
baseline_body_mass
followup_flipper_len
followup_body_mass
1
Gentoo
Biscoe
female
184.5048
3550.784
192.7237
3660.734
2
Adelie
Biscoe
male
198.1783
3583.527
199.5639
3715.988
3
Adelie
Torgersen
male
206.5546
3876.784
213.8792
3988.955
4
Adelie
Dream
male
181.0819
3202.476
189.8613
3288.566
5
Adelie
Torgersen
female
218.0126
4427.798
225.3475
4610.878
The simpler, but less scalable, approach (repetitive columns): Some projects attempted to write separate lines of code to check for outliers or compute means for baseline_flipper_len and then also for followup_flipper_len. This requires copying and pasting code, which violates the DRY (Don’t Repeat Yourself) principle.
The tidyverse way (pivot_longer): Instead, pivot the data so that the metadata (the timepoint and the measure) become values in columns.
Code
# We use names_sep or names_pattern to pull apart the column namestidy_penguins <- wide_penguins %>%pivot_longer(cols =c(contains("baseline"), contains("followup")),names_to =c("timepoint", "measure"),names_sep ="_", values_to ="value" )head(tidy_penguins, 10)
id
species
island
sex
timepoint
measure
value
1
Gentoo
Biscoe
female
baseline
flipper
184.5048
1
Gentoo
Biscoe
female
baseline
body
3550.7838
1
Gentoo
Biscoe
female
followup
flipper
192.7237
1
Gentoo
Biscoe
female
followup
body
3660.7343
2
Adelie
Biscoe
male
baseline
flipper
198.1783
2
Adelie
Biscoe
male
baseline
body
3583.5275
2
Adelie
Biscoe
male
followup
flipper
199.5639
2
Adelie
Biscoe
male
followup
body
3715.9876
3
Adelie
Torgersen
male
baseline
flipper
206.5546
3
Adelie
Torgersen
male
baseline
body
3876.7836
Now, if you want to check for outliers or missingness across all flipper measurements regardless of timepoint, you can simply do:
RMarkdown is fantastic for interactive data exploration. However, a common mistake is leaving interim tweaks and fixes in your final compiled document.
For example, imagine you accidentally code an outlier and then “fix” it rows later:
Code
# Chunk 1: Reading datadf <-read_csv("data.csv")# Chunk 2: Accidental mistake (changing an ID instead of a value)df[299, "id"] <-999# ... 50 lines of other code# Chunk 10: The "Fix"# Oh wait, row 299 was supposed to have ID 105, not 999df[299, "id"] <-105
The Tidyverse Way: A final processing script should reliably produce the same results when you run it from top to bottom. That is, every time you run the script in its entirety, the same things should happen (software developers call this ‘idempotent’).
Never use raw row indices (df[299, ]) to change data. If the data is sorted differently tomorrow, row 299 will be a completely different person!
If you make a mistake in Chunk 2, delete or edit Chunk 2. Do not append a fix in Chunk 10.
Use conditional logic to fix specific targeted values, like this (999 -> 105 if date matches):
The most dangerous errors in R are the ones that don’t produce an explicit error message that stops your script in its tracks. When joining datasets, a left_join or full_join might silently drop rows or explosively duplicate rows.
The Pre-Join Check: Always check if your join keys are unique in the dataset when you expect them to be!
Code
# Are there duplicate IDs in my demographic table before I join?demo_data %>%count(subject_id) %>%filter(n >1) # If this returns rows, STOP! Your join will duplicate data.
The Post-Join Check (tidylog): Several projects had silent join failures (e.g., merging 100 participants but getting datasets that only had 90 without logging who disappeared or why).
The easiest fix is to load the tidylog package at the top of your script. It automatically intercepts dplyr verbs and prints a summary of its operations.
Code
# Example with tidylog activedf1 <-tibble(id =1:5, var_a = letters[1:5])df2 <-tibble(id =c(1, 2, 4, 6), var_b = LETTERS[1:4])# Watch the console output below!merged_df <- df1 %>%left_join(df2, by ="id")
left_join: added one column (var_b)
> rows only in x 2
> rows only in df2 (1)
> matched rows 3
> ===
> rows total 5
Notice how tidylog explicitly tells us that 2 rows were missing a match from df2 (var_b will be NA)! This is invaluable for QA.
If you specifically want to see who didn’t match, use anti_join:
Code
# Who is in df1 but absent in df2?missing_from_df2 <-anti_join(df1, df2, by ="id")missing_from_df2
id
var_a
3
c
5
e
4. Crosswalks Instead of Nested ifelse()
When harmonizing demographic categories across sites or waves, it’s tempting to write massive, nested case_when() or ifelse() statements. This works, but it starts to be cumbersome as you have more cases.
The Tidyverse Way (Lookup Tables / Crosswalks): Create a tiny, separate dataframe (a “crosswalk”) that maps the raw values to the clean values, and then simply left_join it. The tibble package has a handy “tribble” function that lets you type out a dataset in the code.
Code
# Create the raw datamessy_sites <-tibble(patient =1:5, raw_site =c(1, "A", "site_a", 2, "B"))# 1. Define the crosswalk (this can even be a CSV you load in!)site_crosswalk <-tribble(~raw_site, ~clean_site,"1", "Site A","A", "Site A","site_a", "Site A","2", "Site B","B", "Site B")# 2. Harmonize dataset types (e.g. force to character)messy_sites <- messy_sites %>%mutate(raw_site =as.character(raw_site))
mutate: no changes
Code
# 3. Join! (tidylog will tell us if it worked)clean_data <- messy_sites %>%left_join(site_crosswalk, by ="raw_site")
left_join: added one column (clean_site)
> rows only in x 0
> rows only in site_crosswalk (0)
> matched rows 5
> ===
> rows total 5
Code
clean_data
patient
raw_site
clean_site
1
1
Site A
2
A
Site A
3
site_a
Site A
4
2
Site B
5
B
Site B
This can significantly simplify your main data script. If a new messy site code appears (like "site_B_new"), you just add one row to your crosswalk table; you don’t have to hunt down and edit a massive case_when() block deep in your code. Note that you can also save your crosswalk table to a file (e.g., CSV) and read that in for the join.