Improving Your Data Pipelines: Tidyverse Best Practices for QA

Author

Michael Hallquist

Published

April 16, 2026

Introduction

You’ve all made excellent progress in learning how to wrangle, join, and QA data in R! After reviewing your QA projects, I want to highlight a few common patterns that can make data pipelines brittle, hard to read, or prone to silent errors. These challenges came up in several QA projects, so thought I would prepare this document for everyone.

In this document, we will cover four next-level tidyverse practices that will make your code more robust and professional:

Wide, repetitive columns vs. pivot_longer()
Treating RMarkdown as a reproducible pipeline, not a scratchpad
Join fidelity and the tidylog package
Using crosswalks (lookup tables) instead of longer case_when() and ifelse()

1. Wide, Repetitive Columns vs. `pivot_longer()`

When working with longitudinal data or data from multiple sites (e.g., Wave 1 vs Wave 2, or Site A vs Site B), a common instinct is to create “wide” datasets where every combination of measure and timepoint gets its own dedicated column (e.g., wave1_anxiety, wave2_anxiety).

For example, imagine a dataset where we measure penguin flipper length and body mass at two different time points (baseline and follow-up).

Code

head(wide_penguins)

id	species	island	sex	baseline_flipper_len	baseline_body_mass	followup_flipper_len	followup_body_mass
1	Gentoo	Biscoe	female	184.5048	3550.784	192.7237	3660.734
2	Adelie	Biscoe	male	198.1783	3583.527	199.5639	3715.988
3	Adelie	Torgersen	male	206.5546	3876.784	213.8792	3988.955
4	Adelie	Dream	male	181.0819	3202.476	189.8613	3288.566
5	Adelie	Torgersen	female	218.0126	4427.798	225.3475	4610.878

The simpler, but less scalable, approach (repetitive columns): Some projects attempted to write separate lines of code to check for outliers or compute means for baseline_flipper_len and then also for followup_flipper_len. This requires copying and pasting code, which violates the DRY (Don’t Repeat Yourself) principle.

The tidyverse way (pivot_longer): Instead, pivot the data so that the metadata (the timepoint and the measure) become values in columns.

Code

# We use names_sep or names_pattern to pull apart the column names
tidy_penguins <- wide_penguins %>%
  pivot_longer(
    cols = c(contains("baseline"), contains("followup")),
    names_to = c("timepoint", "measure"),
    names_sep = "_", 
    values_to = "value"
  )

head(tidy_penguins, 10)

id	species	island	sex	timepoint	measure	value
1	Gentoo	Biscoe	female	baseline	flipper	184.5048
1	Gentoo	Biscoe	female	baseline	body	3550.7838
1	Gentoo	Biscoe	female	followup	flipper	192.7237
1	Gentoo	Biscoe	female	followup	body	3660.7343
2	Adelie	Biscoe	male	baseline	flipper	198.1783
2	Adelie	Biscoe	male	baseline	body	3583.5275
2	Adelie	Biscoe	male	followup	flipper	199.5639
2	Adelie	Biscoe	male	followup	body	3715.9876
3	Adelie	Torgersen	male	baseline	flipper	206.5546
3	Adelie	Torgersen	male	baseline	body	3876.7836

Now, if you want to check for outliers or missingness across all flipper measurements regardless of timepoint, you can simply do:

Code

tidy_penguins %>%
  filter(measure == "flipper") %>%
  group_by(timepoint) %>%
  summarize(
    mean_len = mean(value),
    min_len = min(value)
  )

timepoint	mean_len	min_len
baseline	197.6665	181.0819
followup	204.2751	189.8613

2. RMarkdown is a Pipeline, Not a Scratchpad

RMarkdown is fantastic for interactive data exploration. However, a common mistake is leaving interim tweaks and fixes in your final compiled document.

For example, imagine you accidentally code an outlier and then “fix” it rows later:

Code

# Chunk 1: Reading data
df <- read_csv("data.csv")

# Chunk 2: Accidental mistake (changing an ID instead of a value)
df[299, "id"] <- 999 

# ... 50 lines of other code

# Chunk 10: The "Fix"
# Oh wait, row 299 was supposed to have ID 105, not 999
df[299, "id"] <- 105

The Tidyverse Way: A final processing script should reliably produce the same results when you run it from top to bottom. That is, every time you run the script in its entirety, the same things should happen (software developers call this ‘idempotent’).

Never use raw row indices (df[299, ]) to change data. If the data is sorted differently tomorrow, row 299 will be a completely different person!
If you make a mistake in Chunk 2, delete or edit Chunk 2. Do not append a fix in Chunk 10.
Use conditional logic to fix specific targeted values, like this (999 -> 105 if date matches):

Code

df <- df %>%
  mutate(
    id = if_else(id == 999 & original_date == "2026-08-14", 105, id)
  )

3. Join Fidelity and the `tidylog` Package

The most dangerous errors in R are the ones that don’t produce an explicit error message that stops your script in its tracks. When joining datasets, a left_join or full_join might silently drop rows or explosively duplicate rows.

The Pre-Join Check: Always check if your join keys are unique in the dataset when you expect them to be!

Code

# Are there duplicate IDs in my demographic table before I join?
demo_data %>%
  count(subject_id) %>%
  filter(n > 1) 
# If this returns rows, STOP! Your join will duplicate data.

The Post-Join Check (tidylog): Several projects had silent join failures (e.g., merging 100 participants but getting datasets that only had 90 without logging who disappeared or why).

The easiest fix is to load the tidylog package at the top of your script. It automatically intercepts dplyr verbs and prints a summary of its operations.

Code

# Example with tidylog active
df1 <- tibble(id = 1:5, var_a = letters[1:5])
df2 <- tibble(id = c(1, 2, 4, 6), var_b = LETTERS[1:4])

# Watch the console output below!
merged_df <- df1 %>% left_join(df2, by = "id")

left_join: added one column (var_b)
           > rows only in x    2
           > rows only in df2 (1)
           > matched rows      3
           >                  ===
           > rows total        5

Notice how tidylog explicitly tells us that 2 rows were missing a match from df2 (var_b will be NA)! This is invaluable for QA.

If you specifically want to see who didn’t match, use anti_join:

Code

# Who is in df1 but absent in df2?
missing_from_df2 <- anti_join(df1, df2, by = "id")
missing_from_df2

id	var_a
3	c
5	e

4. Crosswalks Instead of Nested `ifelse()`

When harmonizing demographic categories across sites or waves, it’s tempting to write massive, nested case_when() or ifelse() statements. This works, but it starts to be cumbersome as you have more cases.

Code

# The simple way
data <- data %>%
  mutate(
    clean_site = case_when(
      site_id == 1 ~ "Site A",
      site_id == "A" ~ "Site A",
      site_id == "site_a" ~ "Site A",
      site_id == 2 ~ "Site B",
      site_id == "B" ~ "Site B",
      TRUE ~ "Unknown"
    )
  )

This is hard to read and prone to typos.

The Tidyverse Way (Lookup Tables / Crosswalks): Create a tiny, separate dataframe (a “crosswalk”) that maps the raw values to the clean values, and then simply left_join it. The tibble package has a handy “tribble” function that lets you type out a dataset in the code.

Code

# Create the raw data
messy_sites <- tibble(patient = 1:5, raw_site = c(1, "A", "site_a", 2, "B"))

# 1. Define the crosswalk (this can even be a CSV you load in!)
site_crosswalk <- tribble(
  ~raw_site, ~clean_site,
  "1",       "Site A",
  "A",       "Site A",
  "site_a",  "Site A",
  "2",       "Site B",
  "B",       "Site B"
)

# 2. Harmonize dataset types (e.g. force to character)
messy_sites <- messy_sites %>% mutate(raw_site = as.character(raw_site))

mutate: no changes

Code

# 3. Join! (tidylog will tell us if it worked)
clean_data <- messy_sites %>%
  left_join(site_crosswalk, by = "raw_site")

left_join: added one column (clean_site)
           > rows only in x               0
           > rows only in site_crosswalk (0)
           > matched rows                 5
           >                             ===
           > rows total                   5

Code

clean_data

patient	raw_site	clean_site
1	1	Site A
2	A	Site A
3	site_a	Site A
4	2	Site B
5	B	Site B

This can significantly simplify your main data script. If a new messy site code appears (like "site_B_new"), you just add one row to your crosswalk table; you don’t have to hunt down and edit a massive case_when() block deep in your code. Note that you can also save your crosswalk table to a file (e.g., CSV) and read that in for the join.

Introduction

1. Wide, Repetitive Columns vs. pivot_longer()

2. RMarkdown is a Pipeline, Not a Scratchpad

3. Join Fidelity and the tidylog Package

4. Crosswalks Instead of Nested ifelse()

1. Wide, Repetitive Columns vs. `pivot_longer()`

3. Join Fidelity and the `tidylog` Package

4. Crosswalks Instead of Nested `ifelse()`