Improving Your Data Pipelines: Tidyverse Best Practices for QA

Author

Michael Hallquist

Published

April 16, 2026

Introduction

You’ve all made excellent progress in learning how to wrangle, join, and QA data in R! After reviewing your QA projects, I want to highlight a few common patterns that can make data pipelines brittle, hard to read, or prone to silent errors. These challenges came up in several QA projects, so thought I would prepare this document for everyone.

In this document, we will cover four next-level tidyverse practices that will make your code more robust and professional:

  1. Wide, repetitive columns vs. pivot_longer()
  2. Treating RMarkdown as a reproducible pipeline, not a scratchpad
  3. Join fidelity and the tidylog package
  4. Using crosswalks (lookup tables) instead of longer case_when() and ifelse()

1. Wide, Repetitive Columns vs. pivot_longer()

When working with longitudinal data or data from multiple sites (e.g., Wave 1 vs Wave 2, or Site A vs Site B), a common instinct is to create “wide” datasets where every combination of measure and timepoint gets its own dedicated column (e.g., wave1_anxiety, wave2_anxiety).

For example, imagine a dataset where we measure penguin flipper length and body mass at two different time points (baseline and follow-up).

Code
head(wide_penguins)
id species island sex baseline_flipper_len baseline_body_mass followup_flipper_len followup_body_mass
1 Gentoo Biscoe female 184.5048 3550.784 192.7237 3660.734
2 Adelie Biscoe male 198.1783 3583.527 199.5639 3715.988
3 Adelie Torgersen male 206.5546 3876.784 213.8792 3988.955
4 Adelie Dream male 181.0819 3202.476 189.8613 3288.566
5 Adelie Torgersen female 218.0126 4427.798 225.3475 4610.878

The simpler, but less scalable, approach (repetitive columns): Some projects attempted to write separate lines of code to check for outliers or compute means for baseline_flipper_len and then also for followup_flipper_len. This requires copying and pasting code, which violates the DRY (Don’t Repeat Yourself) principle.

The tidyverse way (pivot_longer): Instead, pivot the data so that the metadata (the timepoint and the measure) become values in columns.

Code
# We use names_sep or names_pattern to pull apart the column names
tidy_penguins <- wide_penguins %>%
  pivot_longer(
    cols = c(contains("baseline"), contains("followup")),
    names_to = c("timepoint", "measure"),
    names_sep = "_", 
    values_to = "value"
  )

head(tidy_penguins, 10)
id species island sex timepoint measure value
1 Gentoo Biscoe female baseline flipper 184.5048
1 Gentoo Biscoe female baseline body 3550.7838
1 Gentoo Biscoe female followup flipper 192.7237
1 Gentoo Biscoe female followup body 3660.7343
2 Adelie Biscoe male baseline flipper 198.1783
2 Adelie Biscoe male baseline body 3583.5275
2 Adelie Biscoe male followup flipper 199.5639
2 Adelie Biscoe male followup body 3715.9876
3 Adelie Torgersen male baseline flipper 206.5546
3 Adelie Torgersen male baseline body 3876.7836

Now, if you want to check for outliers or missingness across all flipper measurements regardless of timepoint, you can simply do:

Code
tidy_penguins %>%
  filter(measure == "flipper") %>%
  group_by(timepoint) %>%
  summarize(
    mean_len = mean(value),
    min_len = min(value)
  )
timepoint mean_len min_len
baseline 197.6665 181.0819
followup 204.2751 189.8613

2. RMarkdown is a Pipeline, Not a Scratchpad

RMarkdown is fantastic for interactive data exploration. However, a common mistake is leaving interim tweaks and fixes in your final compiled document.

For example, imagine you accidentally code an outlier and then “fix” it rows later:

Code
# Chunk 1: Reading data
df <- read_csv("data.csv")

# Chunk 2: Accidental mistake (changing an ID instead of a value)
df[299, "id"] <- 999 

# ... 50 lines of other code

# Chunk 10: The "Fix"
# Oh wait, row 299 was supposed to have ID 105, not 999
df[299, "id"] <- 105 

The Tidyverse Way: A final processing script should reliably produce the same results when you run it from top to bottom. That is, every time you run the script in its entirety, the same things should happen (software developers call this ‘idempotent’).

  • Never use raw row indices (df[299, ]) to change data. If the data is sorted differently tomorrow, row 299 will be a completely different person!
  • If you make a mistake in Chunk 2, delete or edit Chunk 2. Do not append a fix in Chunk 10.
  • Use conditional logic to fix specific targeted values, like this (999 -> 105 if date matches):
Code
df <- df %>%
  mutate(
    id = if_else(id == 999 & original_date == "2026-08-14", 105, id)
  )

3. Join Fidelity and the tidylog Package

The most dangerous errors in R are the ones that don’t produce an explicit error message that stops your script in its tracks. When joining datasets, a left_join or full_join might silently drop rows or explosively duplicate rows.

The Pre-Join Check: Always check if your join keys are unique in the dataset when you expect them to be!

Code
# Are there duplicate IDs in my demographic table before I join?
demo_data %>%
  count(subject_id) %>%
  filter(n > 1) 
# If this returns rows, STOP! Your join will duplicate data.

The Post-Join Check (tidylog): Several projects had silent join failures (e.g., merging 100 participants but getting datasets that only had 90 without logging who disappeared or why).

The easiest fix is to load the tidylog package at the top of your script. It automatically intercepts dplyr verbs and prints a summary of its operations.

Code
# Example with tidylog active
df1 <- tibble(id = 1:5, var_a = letters[1:5])
df2 <- tibble(id = c(1, 2, 4, 6), var_b = LETTERS[1:4])

# Watch the console output below!
merged_df <- df1 %>% left_join(df2, by = "id")
left_join: added one column (var_b)
           > rows only in x    2
           > rows only in df2 (1)
           > matched rows      3
           >                  ===
           > rows total        5

Notice how tidylog explicitly tells us that 2 rows were missing a match from df2 (var_b will be NA)! This is invaluable for QA.

If you specifically want to see who didn’t match, use anti_join:

Code
# Who is in df1 but absent in df2?
missing_from_df2 <- anti_join(df1, df2, by = "id")
missing_from_df2
id var_a
3 c
5 e

4. Crosswalks Instead of Nested ifelse()

When harmonizing demographic categories across sites or waves, it’s tempting to write massive, nested case_when() or ifelse() statements. This works, but it starts to be cumbersome as you have more cases.

Code
# The simple way
data <- data %>%
  mutate(
    clean_site = case_when(
      site_id == 1 ~ "Site A",
      site_id == "A" ~ "Site A",
      site_id == "site_a" ~ "Site A",
      site_id == 2 ~ "Site B",
      site_id == "B" ~ "Site B",
      TRUE ~ "Unknown"
    )
  )

This is hard to read and prone to typos.

The Tidyverse Way (Lookup Tables / Crosswalks): Create a tiny, separate dataframe (a “crosswalk”) that maps the raw values to the clean values, and then simply left_join it. The tibble package has a handy “tribble” function that lets you type out a dataset in the code.

Code
# Create the raw data
messy_sites <- tibble(patient = 1:5, raw_site = c(1, "A", "site_a", 2, "B"))

# 1. Define the crosswalk (this can even be a CSV you load in!)
site_crosswalk <- tribble(
  ~raw_site, ~clean_site,
  "1",       "Site A",
  "A",       "Site A",
  "site_a",  "Site A",
  "2",       "Site B",
  "B",       "Site B"
)

# 2. Harmonize dataset types (e.g. force to character)
messy_sites <- messy_sites %>% mutate(raw_site = as.character(raw_site))
mutate: no changes
Code
# 3. Join! (tidylog will tell us if it worked)
clean_data <- messy_sites %>%
  left_join(site_crosswalk, by = "raw_site")
left_join: added one column (clean_site)
           > rows only in x               0
           > rows only in site_crosswalk (0)
           > matched rows                 5
           >                             ===
           > rows total                   5
Code
clean_data
patient raw_site clean_site
1 1 Site A
2 A Site A
3 site_a Site A
4 2 Site B
5 B Site B

This can significantly simplify your main data script. If a new messy site code appears (like "site_B_new"), you just add one row to your crosswalk table; you don’t have to hunt down and edit a massive case_when() block deep in your code. Note that you can also save your crosswalk table to a file (e.g., CSV) and read that in for the join.