QA for Antisaccade Data Using List-Columns

Overview

This notebook provides a tidyverse/purrr-friendly approach to per-run QA and processing in a large multilevel dataset. Here, we have eyetracking data from an antisaccade task and a list object eyeData that contains metadata, trial-level accuracy and reaction time information, and QA decision about trials to drop (e.g., due to blinks). The core idea is to keep each subject-run as a list-column row, then use purrr to map over those nested data frames. This yields code that is more explicit about what is happening at each level (trial, run, subject), and it avoids repeated joins or manual loops.

Setup

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(readr)

# Resolve paths relative to this Qmd so site builds are portable.
qmd_dir <- dirname(knitr::current_input())
if (!nzchar(qmd_dir)) qmd_dir <- getwd()
project_dir <- file.path(qmd_dir, "antisaccade")
files_dir <- file.path(qmd_dir, "..", "files")

load(file.path(files_dir, "eyeDataList_14Jul2014.Rdata")) # created by Import_Bars_Scan_Behavior.R
subjInfo <- read_csv(file.path(files_dir, "subj_info_run_level.csv.gz"), show_col_types = FALSE)
aggSubjInfo <- read_csv(file.path(files_dir, "subj_info.csv.gz"), show_col_types = FALSE)

expected_trials_per_run <- 42
rt_bounds_qa <- c(150, 1000)
rt_bounds_keep <- c(100, 1200)

Helper functions

These helpers are used inside purrr workflows. The checkmate checks are intentionally strict so errors surface early and locally rather than downstream in long pipelines.

add_id_cols <- function(df, luna_id, run_id) {
  checkmate::assert_data_frame(df, min.rows = 0)
  checkmate::assert_integerish(luna_id, len = 1, any.missing = FALSE)
  checkmate::assert_integerish(run_id, len = 1, any.missing = FALSE)
  if (nrow(df) == 0) {
    df$LunaID <- integer(0)
    df$run <- integer(0)
    return(df)
  }
  df %>% mutate(LunaID = as.integer(luna_id), run = as.integer(run_id))
}

count_valence <- function(df, valence, lat_code = 2) {
  checkmate::assert_data_frame(df, min.rows = 0)
  checkmate::assert_names(names(df), must.include = c("valence", "Define.Lat"))
  checkmate::assert_character(valence, len = 1, any.missing = FALSE)
  checkmate::assert_integerish(lat_code, len = 1, any.missing = FALSE)
  if (nrow(df) == 0) return(0L)
  sum(df$valence == valence & df$Define.Lat == lat_code, na.rm = TRUE)
}

find_duplicate_trials <- function(df) {
  if (!checkmate::test_data_frame(df)) {
    if (checkmate::test_list(df)) {
      df <- tryCatch(
        as.data.frame(df),
        error = function(e) {
          warning("find_duplicate_trials: unable to coerce list to data frame; returning empty tibble.")
          warning(as.character(e))
          return(NULL)
        }
      )
    } else {
      checkmate::assert_data_frame(df)
    }
  }

  if (is.null(df) || nrow(df) == 0) return(tibble())
  checkmate::assert_names(names(df), must.include = "Trial")
  df %>% count(Trial, name = "n") %>% filter(n > 1)
}

List-column walkthrough (core idea)

The data object eyeData is a list of run-level records. Each element contains data frames such as raw.data, correct, and error. Instead of unnesting immediately, we build a tibble where each row represents one run. Each run has list-columns that point to its nested data frames. This gives you a compact run-level index that you can map over.

Key purrr tools used here:

map() creates a list-column by applying a function to each element.
map_int() is the same pattern but enforces an integer return value (useful for counts).
pmap_dfr() iterates over multiple parallel inputs and row-binds the resulting data frames.

eye_tbl <- tibble(
  LunaID = map_int(eyeData, ~ as.integer(.x$LunaID)),
  run = map_int(eyeData, ~ as.integer(.x$run)),
  raw = map(eyeData, "raw.data"),
  correct = map(eyeData, "correct"),
  error = map(eyeData, "error"),
  drop_trials = map(eyeData, "drop.trials")
)

dplyr::glimpse(eye_tbl)

Rows: 678
Columns: 6
$ LunaID      <int> 10128, 10128, 10128, 10128, 10134, 10134, 10134, 10134, 10…
$ run         <int> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4…
$ raw         <list> [<data.frame[134 x 15]>], [<data.frame[157 x 15]>], [<dat…
$ correct     <list> [<data.frame[33 x 9]>], [<data.frame[27 x 9]>], [<data.fr…
$ error       <list> [<data.frame[1 x 9]>], [<data.frame[3 x 9]>], [<data.fram…
$ drop_trials <list> <13, 16, 17, 20, 24, 26, 29, 39>, <2, 9, 12, 13, 15, 22, …

Data dictionary (nested columns)

The table above has one row per run, and several list-columns containing nested data frames. These are the most relevant columns used in this workflow:

LunaID: subject identifier (integer).
run: run number within subject (integer).
raw: data frame of raw trial-level columns (e.g., timing and trial metadata) for the run.
correct: data frame of correct trials for the run; includes columns such as Trial, Latency, valence, and Define.Lat.
error: data frame of error trials for the run; includes Trial, valence, and Define.Lat.
drop_trials: vector of trial IDs dropped for the run (length used for QA totals).

Why list-columns help here

Separation of levels: you keep a clean table of runs without flattening every trial.
Targeted QA: you can run different checks at the run level, trial level, or subject level by mapping the right function over the right list-column.
Performance: you only compute what each step needs; results aren’t combined into one big data frame until you bind them, which reduces extra work and memory use.

Mapping over list-columns

Run-level integrity checks are a good example. We map across each run to compute counts using map_int(), then compute expected totals in a single transmute() call.

# Multilevel QA: file-level integrity (consistent columns)
raw_name_sets <- eye_tbl %>% transmute(names = map(raw, names))
unique(raw_name_sets$names)

[[1]]
 [1] "Start"      "Target"     "Trial"      "Count"      "Sacc.."    
 [6] "Score."     "Latency"    "Define.Lat" "Acc.Deg."   "Define.Acc"
[11] "Notes"      "Logic"      "X..Correct" "X"          "X.1"       

[[2]]
 [1] "Start"      "Target"     "Trial"      "Count"      "Sacc.."    
 [6] "Score."     "Latency"    "Define.Lat" "Acc.Deg."   "Define.Acc"
[11] "Notes"      "Logic"      "X..Correct" "X"          "X.1"       
[16] "X.2"        "X.3"        "X.4"

# Multilevel QA: trial-level integrity (no duplicate trials)
dup_trials <- pmap_dfr(
  eye_tbl %>% select(correct, LunaID, run),
  function(correct, LunaID, run) {
    out <- find_duplicate_trials(correct)
    if (nrow(out) == 0) return(tibble())
    out %>% mutate(LunaID = LunaID, run = run, .before = 1)
  }
)

if (nrow(dup_trials) > 0) {
  print(dup_trials)
}

# Multilevel QA: run-level integrity (expected trial counts)
trial_counts <- eye_tbl %>%
  transmute(
    LunaID,
    run,
    n_correct = map_int(correct, nrow),
    n_error = map_int(error, nrow),
    n_drop = map_int(drop_trials, length),
    n_total = n_correct + n_error + n_drop
  )

trial_issues <- trial_counts %>% filter(n_total != expected_trials_per_run)
if (nrow(trial_issues) > 0) {
  print(trial_issues)
}

# A tibble: 412 × 6
   LunaID   run n_correct n_error n_drop n_total
    <int> <int>     <int>   <int>  <int>   <int>
 1  10128     2        27       3     13      43
 2  10153     3        23       2     18      43
 3  10156     1        25      24      2      51
 4  10181     4        38       2      3      43
 5  10406     2        39       2      2      43
 6  10466     1        27       5     12      44
 7  10466     2        26       2     15      43
 8  10466     3        22       8     16      46
 9  10466     4        28       4     12      44
10  10565     1        29      13      6      48
# ℹ 402 more rows

Nest and unnest in tidyverse

List-columns often start from nested data, but they can also be created from flat data using nest(). Two common patterns:

nest() groups a flat data frame into list-columns, usually to run per-group models or summaries.
unnest() expands a list-column back into a flat, row-wise table for plotting or modeling.

This workflow starts with already-nested data (eyeData), but it uses the same principles. Here are examples of both operations in the tidyverse style:

# Example: nest a flat table into run-level list-columns
# (This is illustrative; the script uses the pre-nested eyeData object.)
# run_tbl <- flat_trials %>%
#   group_by(LunaID, run) %>%
#   nest(trials = c(Trial, Latency, valence, Define.Lat))

# Example: unnest a list-column back to trial-level rows
# trials_long <- run_tbl %>%
#   unnest(trials)

The advantage of the list-column approach is that it keeps run-specific trial data together until you need to flatten it. In this script, pmap_dfr() plays the role of a structured unnest when we want to attach IDs and row-bind per-run frames.

Accuracy computations

We can compute accuracy at the run level by mapping over list-columns and then summarizing at the subject level. This keeps the number of join operations low and makes the role of each level explicit.

# subject 10590 is missing all latencies, drop before accuracy QA
eye_tbl <- eye_tbl %>% filter(LunaID != 10590)

accuracy <- eye_tbl %>%
  transmute(
    LunaID,
    run,
    tot_correct = map_int(correct, nrow),
    tot_error = map_int(error, ~ sum(.x$Define.Lat == 2, na.rm = TRUE)),
    tot_drop = map_int(drop_trials, length),
    rew_correct = map_int(correct, ~ sum(.x$valence == "Rew", na.rm = TRUE)),
    neu_correct = map_int(correct, ~ sum(.x$valence == "Neu", na.rm = TRUE)),
    loss_correct = map_int(correct, ~ sum(.x$valence == "Loss", na.rm = TRUE)),
    rew_error = map_int(error, ~ count_valence(.x, "Rew", lat_code = 2)),
    neu_error = map_int(error, ~ count_valence(.x, "Neu", lat_code = 2)),
    loss_error = map_int(error, ~ count_valence(.x, "Loss", lat_code = 2))
  ) %>%
  mutate(
    tot_acc = tot_correct / (tot_correct + tot_error),
    rew_acc = rew_correct / (rew_correct + rew_error),
    neu_acc = neu_correct / (neu_correct + neu_error),
    loss_acc = loss_correct / (loss_correct + loss_error)
  )

accuracy.sumSessions <- accuracy %>%
  group_by(LunaID) %>%
  summarize(
    across(
      c(
        tot_correct, tot_error, tot_drop,
        rew_correct, rew_error, neu_correct, neu_error, loss_correct, loss_error
      ),
      \(x) sum(x, na.rm = TRUE)
    ),
    .groups = "drop"
  ) %>%
  mutate(
    tot_acc = tot_correct / (tot_correct + tot_error),
    rew_acc = rew_correct / (rew_correct + rew_error),
    neu_acc = neu_correct / (neu_correct + neu_error),
    loss_acc = loss_correct / (loss_correct + loss_error)
  )

# low total accuracy subjects
accuracy.sumSessions %>%
  arrange(tot_acc) %>%
  select(LunaID, tot_acc, rew_acc, neu_acc, loss_acc) %>%
  head()

LunaID	tot_acc	rew_acc	neu_acc	loss_acc
10648	0.3097345	0.2631579	0.4210526	0.2432432
10820	0.4601770	0.5000000	0.4615385	0.4210526
10696	0.6000000	0.5961538	0.6200000	0.5849057
10672	0.6173913	0.5500000	0.7142857	0.6000000
10646	0.6466165	0.6904762	0.6875000	0.5581395
10777	0.6600000	0.7027027	0.6206897	0.6470588

# subject 10648 had very low accuracy
eye_tbl <- eye_tbl %>% filter(LunaID != 10648)

# save interim progress
#
# What changes when you run this?
# - Writes barsAccuracy_29Jan2026.RData to disk (overwrites if it exists).
save(file = file.path(project_dir, "barsAccuracy_29Jan2026.RData"), accuracy, accuracy.sumSessions)

Latency analysis

To analyze latencies, we collapse the list-columns into a single trial-level table using pmap_dfr() and our add_id_cols() helper. This is one of the best ways to unnest nested frames while carrying identifiers along the way.

# analyze correct trial latencies
correctDF <- pmap_dfr(
  eye_tbl %>% select(correct, LunaID, run),
  function(correct, LunaID, run) add_id_cols(correct, LunaID, run)
)

# questionable RTs for review
low_rts <- correctDF %>%
  filter(Latency < rt_bounds_qa[1]) %>%
  select(LunaID, run, Trial, Latency)

cat("Low RTs (showing first 25 of", nrow(low_rts), "rows):\n\n")

Low RTs (showing first 25 of 68 rows):

low_rts %>%
  slice_head(n = 25) %>%
  knitr::kable()

LunaID	run	Trial	Latency
10152	2	6	133
10152	4	11	117
10152	4	15	133
10173	4	39	100
10565	3	30	83
10568	3	11	117
10572	4	29	67
10616	4	39	117
10623	1	9	100
10623	1	13	67
10623	1	30	67
10623	2	26	117
10623	4	6	133
10625	1	10	100
10637	1	22	83
10646	2	40	100
10654	2	19	133
10654	2	39	100
10662	1	11	83
10662	4	18	100
10671	2	21	100
10672	1	3	100
10672	1	5	117
10672	2	13	117
10672	2	29	133

high_rts <- correctDF %>%
  filter(Latency > rt_bounds_qa[2]) %>%
  select(LunaID, run, Trial, Latency)

cat("High RTs (showing first 25 of", nrow(high_rts), "rows):\n\n")

High RTs (showing first 25 of 105 rows):

high_rts %>%
  slice_head(n = 25) %>%
  knitr::kable()

LunaID	run	Trial	Latency
10153	4	18	1033
10406	1	4	1350
10451	1	39	1067
10565	1	40	1050
10568	2	8	1383
10601	2	40	1150
10616	3	11	1133
10625	1	20	1183
10625	3	9	1367
10635	1	37	1217
10635	1	41	1167
10636	1	28	1183
10637	1	24	1267
10637	2	23	1567
10637	2	29	1317
10637	3	31	1250
10637	4	11	1150
10638	2	24	1383
10646	1	22	1117
10652	1	10	1233
10652	1	11	1333
10652	2	22	1183
10652	4	22	1150
10654	3	41	1217
10662	4	32	1183

questionableRTs <- correctDF %>%
  filter(Latency < rt_bounds_qa[1] | Latency > rt_bounds_qa[2]) %>%
  select(LunaID, run, Trial, Latency)
# What changes when you run this?
# - Writes questionableRTs.csv to disk (overwrites if it exists).
write_csv(questionableRTs, file = file.path(project_dir, "questionableRTs.csv"))

Plotting RT distributions

This section produces a multi-page PDF of boxplots, one page per subset of subjects/runs. We first create a combined LunaID_run label and then split those labels into 10 roughly equal-sized bins using ntile(). That bin (plotsplit) is used to filter the data for each page. The pdf() device stays open while we loop over bins, and each plot(g) call writes a new page to the same PDF. We also compute the global latency range once (lohi) so every page uses consistent y-axis limits, which makes page-to-page comparisons easier.

# rt distributions by subject
# split plots across 10 pages because we have so many units
correctToPlot <- correctDF %>%
  select(LunaID, run, Trial, valence, Latency) %>%
  unite("LunaID_run", LunaID, run) %>%
  mutate(
    LunaID_run = factor(LunaID_run),
    LunaID_run_num = as.numeric(LunaID_run),
    plotsplit = ntile(LunaID_run_num, 10)
  ) %>%
  arrange(LunaID_run)

lohi <- range(correctToPlot$Latency) # keep RT ranges consistent across pages

# What changes when you run this?
# - Writes Antisaccade RTs.pdf to disk (overwrites if it exists).
pdf(file.path(project_dir, "Antisaccade RTs.pdf"), width = 20, height = 50)
for (l in sort(unique(correctToPlot$plotsplit))) {
  g <- ggplot(correctToPlot %>% filter(plotsplit == l), aes(x = LunaID_run, y = Latency, color = valence)) +
    geom_boxplot() +
    coord_flip() +
    theme_bw(base_size = 20) +
    ylim(lohi)
  plot(g)
}
dev.off()

quartz_off_screen 
                2

Aggregate latency summaries

# for now, restrict latencies to the keep range
goodRTs <- correctDF %>% filter(Latency >= rt_bounds_keep[1] & Latency <= rt_bounds_keep[2])

combo <- left_join(
  goodRTs,
  aggSubjInfo %>% select(LunaID, scanage, group),
  by = "LunaID"
)

# compute average latencies and SDs for each valence
latencyByValence <- goodRTs %>%
  select(LunaID, valence, Latency) %>%
  group_by(LunaID, valence) %>%
  summarize(
    MRT = mean(Latency, na.rm = TRUE),
    MSD = sd(Latency, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_longer(c(MRT, MSD), names_to = "stat", values_to = "value") %>%
  unite(valstat, valence, stat) %>%
  pivot_wider(names_from = valstat, values_from = value) %>%
  mutate(
    Overall_MRT = rowMeans(select(., Loss_MRT, Neu_MRT, Rew_MRT), na.rm = TRUE),
    Overall_MSD = rowMeans(select(., Loss_MSD, Neu_MSD, Rew_MSD), na.rm = TRUE)
  )

head(latencyByValence)

LunaID	Rew_MRT	Rew_MSD	Loss_MRT	Loss_MSD	Neu_MRT	Neu_MSD	Overall_MRT	Overall_MSD
10128	479.3111	82.36223	499.1081	73.88797	515.7500	81.75936	498.0564	79.33652
10134	467.8409	56.64124	461.8235	45.91937	460.6122	52.47055	463.4256	51.67705
10152	434.2703	123.94032	456.8095	105.41306	493.8286	82.29412	461.6361	103.88250
10153	498.9394	104.49759	505.9167	98.46577	532.1429	130.98649	512.3330	111.31661
10156	447.0588	66.63830	462.9167	69.84488	459.4643	60.88330	456.4799	65.78883
10173	400.3800	40.33224	399.5952	66.31744	417.3542	41.08242	405.7765	49.24403

Summary: list-columns + purrr in this workflow

map() builds list-columns of nested data frames (per-run data).
map_int() safely computes scalar summaries from each nested table.
pmap_dfr() is the workhorse for row-binding nested data while carrying IDs.
The approach keeps trial-level data nested until a specific analysis requires flattening.