Tidy data overview

Author

Michael Hallquist, PSYC 859

Published

March 12, 2026

Data structure semantics

Most data wrangling can be accomplished using data.frame object (or tbl objects in dplyr). These objects consist of rows and columns, forming a rectangular structure.

Code

gapminder %>% kable_table(n=6)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134

Variables

A variable contains all values measuring an attribute (e.g., neuroticism) across units (e.g., people).

Columns in data.frames are typically labeled and represent variables.

Moreover, in a data.frame, all values in a given column should have the same data type, such as character strings. But a data.frame is different from a matrix object because columns can differ in terms of data type. For example, in the gapminder, the country column is a factor, whereas lifeExp is a numeric column.

“A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.” -Garrett Grolemund Citation

Observations

An observation contains all values measured on the same unit across attributes.

Rows in a data.frame typically represent observations.

Code

gapminder %>% kable_table(n=1)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453

Tidy data

Features of tidy data:

Each variable forms a column
Each observation forms a row
Each type of observational unit (e.g., persons, schools, counties) forms a table.

Furthermore, variables that are part of the design (participant number, experimental condition, county ID, etc.), or that may be key categorical moderators, should typically be placed first (columns to the left), and measured variables thereafter (columns to the right).

This sounds easy, right?! But in truth, we have all seen data that look more like this:

0.6226  WARNING     Movie2 stim could not be imported and won't be available
17.6226     DATA    version J
102.5349    DATA    Keypress: space
110.4876    DATA    Keypress: v
110.6203    DATA    Keypress: v
111.0859    DATA    Keypress: v
112.6459    DATA    Keypress: t
116.4982    DATA    Keypress: v
118.5406    DATA    Keypress: v

For a more detailed treatment of tidy data, see: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Tidying verbs

The tidyr package provides four core functions to aid in converting messy data into tidy form. We may also need functions from dplyr at times. Each of these verbs is also a function that transforms the dataset with the goal of making it more tidy.

Pivot_longer: combine multiple columns into a single column with a key-value pair format
Pivot_wider: divide key-value rows into multiple columns
Separate: split a single variable into multiple variables by pulling apart the values into pieces
Unite: merge two variables (columns) into one, effectively pasting together the values

Note: pivot_longer and pivot_wider are complements. And separate and unite are complements.

Details about the pivot functions can be found here: https://tidyr.tidyverse.org/articles/pivot.html.

Let’s look at a series of datasets (from Wickham 2014) and consider how tidy or messy they are.

Data tidying examples

pivot_longer example

Here is our first mess. Notice that the column headers are values, not variable names. This is untidy and hard to look at. We effectively have the data in a cross-tabulated format, but religion and income are not variables in the dataset.

Messy version

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k
Agnostic	27	34	60	81	76	137
Atheist	12	27	37	52	35	70
Buddhist	27	21	30	34	33	58
Catholic	418	617	732	670	638	1116
Don’t know/refused	15	14	15	11	10	35
Evangelical Prot	575	869	1064	982	881	1486
Hindu	1	9	7	9	11	34
Historically Black Prot	228	244	236	238	197	223
Jehovah's Witness	20	27	24	24	21	30
Jewish	19	19	25	25	30	95

Tidy version

In the tidy version, religion and income become variables, and the number of observations in each religion x income combination is a frequency column. This is now tidy insofar as each value in the frequency column represents a unique combination of the religion and income factors, which are coded as variables.

religion	income	freq
Agnostic	$10-20k	34
Agnostic	$100-150k	109
Agnostic	$20-30k	60
Agnostic	$30-40k	81
Agnostic	$40-50k	76
Agnostic	$50-75k	137
Agnostic	$75-100k	122
Agnostic	<$10k	27
Agnostic	>150k	84
Agnostic	Don't know/refused	96

Tidying solution

To achieve the above transformation, we want to pivot_longer the many columns of income into a single income column.

tidy1 <- mess1 %>% pivot_longer(cols=-religion, names_to="income", values_to="freq")

Here, we tell tidyr that we wish to create a lookup (‘key’) column called income whose correponding values will be called freq (here, representing the frequency of this religion x income combination). Furthermore, as additional arguments to pivot_longer, we provide the columns (cols argument) that should be combined, representing levels of the key variable. By specifying -religion, we are saying ‘all columns except religion.’ The alternative would be to provide a comma-separate list of columns like this mess1 %>% pivot_longer(names_to="income", values_to="freq", cols=c("<$10k", "$10-20k", etc.))

pivot_wider example

In our second mess, we have a weather dataset from the Global Historical Climatology Network for one weather station (MX17004) in Mexico. The data represent minimum and maximum temperatures measured across 31 days for five months. The days within each month are on the columns, the months are encoded as a variable month, and the min and max temperatures are separated by row, as identified by the element variable.

Messy version

Code

#show a subset of columns that fit on the page
mess2 %>% dplyr::select(id:d13) %>% kable_table(n=8)

id	year	month	element	d1	d2	d3	d4	d5	d6	d7	d8	d9	d10	d11	d12	d13
MX17004	2010	1	tmax	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
MX17004	2010	1	tmin	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
MX17004	2010	2	tmax	NA	27.3	24.1	NA	NA	NA	NA	NA	NA	NA	29.7	NA	NA
MX17004	2010	2	tmin	NA	14.4	14.4	NA	NA	NA	NA	NA	NA	NA	13.4	NA	NA
MX17004	2010	3	tmax	NA	NA	NA	NA	32.1	NA	NA	NA	NA	34.5	NA	NA	NA
MX17004	2010	3	tmin	NA	NA	NA	NA	14.2	NA	NA	NA	NA	16.8	NA	NA	NA
MX17004	2010	4	tmax	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
MX17004	2010	4	tmin	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Tidy version

id	year	month	day	tmax	tmin
MX17004	2010	1	30	27.8	14.5
MX17004	2010	2	2	27.3	14.4
MX17004	2010	2	3	24.1	14.4
MX17004	2010	2	11	29.7	13.4
MX17004	2010	2	23	29.9	10.7
MX17004	2010	3	5	32.1	14.2
MX17004	2010	3	10	34.5	16.8
MX17004	2010	3	16	31.1	17.6

Tidying solution

To clean this up, we need to bring all of the day columns together using pivot_longer so that we can encode day as a variable and temperature as a variable.

We also may want to have max and min temperature as separate columns (i.e., variables), rather than keeping that as a key-value pair. That is, the tmin and tmax values denote the attributes of a single observation, which would usually be represented as separate variables in tidy format. To obtain min and max temperatures as separate columns, we use pivot_wider to move the element values onto separate columns.

Here is the basic approach:

#use num_range() to select variables called d1--d31
tidy2 <- mess2 %>% 
  pivot_longer(
    cols = num_range("d", 1:31), 
    names_to = "day", 
    values_to = "temperature",
    names_prefix = "d", #trim off the 'd'
    names_transform = list(day = as.integer)
  ) %>%
  pivot_wider(names_from = "element", values_from = "temperature") %>%
  na.omit()

Notice that pivot_longer has a few built-in arguments for helping us trim off parts of the column names that are not data per se. Here, we have d1–d31, but only the numeric part of that is data. The names_prefix="d" tells pivot_longer to trim the leading ‘d’ from every value in the day column. The names_transform=list(day=as.integer) argument then converts the resulting day values to integers so they behave like a numeric variable rather than text.

Here, pivot_wider a key – element – that has values 'tmin' or 'tmax' and puts the values of these rows onto columns. This is a kind of ‘long-to-wide’ conversion and we would expect here for the number of rows in the dataset drop two-fold with the pivot_wider compared to the preceding step where we’ve gathered the day columns.

It is often useful to check the number of rows after each step in a data transformation pipeline. Here, I just break up the pipeline into the pivot_longer and pivot_wider steps and check the structure in between.

Code

tidy2.1 <- mess2 %>% 
  pivot_longer(
    cols = num_range("d", 1:31), names_to = "day", values_to = "temperature",
    names_prefix = "d", names_transform = list(day = as.integer)
  )

nrow(tidy2.1)

[1] 682

Code

tidy2.2 <- tidy2.1 %>%
  pivot_wider(names_from = "element", values_from = "temperature")

# with NAs included (since data are sparse), we get the expected 50% reduction in rows
nrow(tidy2.2)

[1] 341

Code

# there are only 33 useful/present observations
nrow(tidy2.2 %>% na.omit)

[1] 33

Missingness: implicit vs explicit

Tidy data often needs missing values made explicit (or dropped) so analyses behave as intended.

Code

tidy_missing <- tibble::tribble(
  ~id, ~day, ~score,
  1, "Mon", 10,
  1, "Wed", 8,
  2, "Mon", 9,
  3, "Mon", NA_real_
)

# Drop rows where all pivoted values are NA
tidy_missing %>%
  tidyr::pivot_wider(names_from = "day", values_from = "score") %>%
  dplyr::filter(!dplyr::if_all(-id, is.na))

id	Mon	Wed
1	10	8
2	9	NA

Code

# Make implicit missing combinations explicit
tidy_missing %>%
  tidyr::complete(id, day)

id	day	score
1	Mon	10
1	Wed	8
2	Mon	9
2	Wed	NA
3	Mon	NA
3	Wed	NA

Code

# Fill down within a group (e.g., carry forward metadata). Grouping ensures we
# only fill within each id; ungroup to avoid carrying this into later steps.
tidy_missing %>%
  tidyr::complete(id, day) %>%
  dplyr::group_by(id) %>%
  tidyr::fill(score, .direction = "down") %>%
  dplyr::ungroup()

id	day	score
1	Mon	10
1	Wed	8
2	Mon	9
2	Wed	9
3	Mon	NA
3	Wed	NA

separate example

In our third mess, we have multiple variables stored in one column. More specifically, in these data, the ‘m014’ etc. columns represent a combination of sex (m/f) and age range (e.g., 014 is 0–14). The country and year columns are ‘tidy’ because they represent variables, but the sex + age columns are not.

Messy version

Code

#use select to select a few columns that can fit on the page
mess3 %>% dplyr::select(country:f1524) %>% kable_table(n=5)

country	year	m014	m1524	m2534	m3544	m4554	m5564	m65	mu	f014	f1524
AD	2000	0	0	1	0	0	0	0	NA	NA	NA
AE	2000	2	4	4	6	5	12	10	NA	3	16
AF	2000	52	228	183	149	129	94	80	NA	93	414
AG	2000	0	0	0	0	0	0	1	NA	1	1
AL	2000	2	19	21	14	24	19	16	NA	3	11

Tidy version

country	year	sex	age_range	freq
AD	2000	m	0-14	0
AE	2000	m	0-14	2
AF	2000	m	0-14	52
AG	2000	m	0-14	0
AL	2000	m	0-14	2
AM	2000	m	0-14	2
AN	2000	m	0-14	0
AO	2000	m	0-14	186

Tidying solution

We essentially need to parse apart the ‘m’ from the ‘014’ components of each value, which is a job for separate. Note that we also need to pivot_longer the wacky sex + age columns first to make this easier. Here I use cols=c(-country, -year) to say, ‘all columns except these.’

The sep argument of separate tells R how to split the values into multiple variables. Here, by using the number 1, we ask for the first character to become sex and the rest to become age_range.

Code

tidy3 <- mess3 %>% 
  pivot_longer(names_to="sex_age", values_to="freq", cols=c(-country, -year)) %>%
  separate(sex_age, into=c("sex", "age_range"), sep=1)

Here, we gathered all columns except country and year into a single key-value pair using pivot_longer. This is an intermediate stage of the dataset that is semi-tidy. We then separate the sex and age components of the values into different variables, resulting in a tidy dataset.

country	year	sex	age_range	freq
AD	2000	m	014	0
AD	2000	m	1524	0
AD	2000	m	2534	1
AD	2000	m	3544	0
AD	2000	m	4554	0
AD	2000	m	5564	0
AD	2000	m	65	0
AD	2000	m	u	NA

This is pretty close. The age_range variable is still a little clunky because it isn’t easy to read. We could modify this further using mutate and recode from dplyr, but that’s not the immediate emphasis here.

Code

tidy3 <- tidy3 %>% mutate(age_range=recode(age_range,
                                           "014"="0-14",
                                           "1524"="15-24",
                                           "2534"="25-34",
                                           "3544"="35-44",
                                           "4554"="45-54",
                                           "5564"="55-64",
                                           "65"="65+",
                                           "u"="unknown", .default=NA_character_
))

Modern `separate_*` alternatives

For simple string splitting, tidyr now provides separate_wider_delim() (split on a delimiter) and separate_wider_regex() (split using a regex). These are often clearer than separate() because they create named columns directly.

Code

toy_people <- tibble::tibble(
  person = c("Ada-Lovelace", "Grace-Hopper", "Katherine-Johnson")
)

toy_people %>% separate_wider_delim(person, delim = "-", names = c("first", "last"))

first	last
Ada	Lovelace
Grace	Hopper
Katherine	Johnson

Code

toy_people %>% separate_wider_regex(person, patterns = c(first = "^[^-]+", "-",
                                                         last = "[^-]+$"
))

first	last
Ada	Lovelace
Grace	Hopper
Katherine	Johnson

Tidying solution using pivot_longer alone

I am ambivalent about whether it is useful to combine separable objectives into a single data wrangling verb. Nevertheless, I want to note that the pivot_longer function provides added functionality for both combining columns into a key-value pair format and splitting the key into multiple variables if the key variable is an amalgamation of discrete variables. This can allow us to skip the separate step:

Code

tidy3 <- mess3 %>% 
  pivot_longer(
    cols = c(-country, -year), 
    names_to = c("sex", "age_range"),
    names_pattern = "(.)(.*)", #first character versus the rest
    values_to = "freq",
    names_ptypes = list(
      age_range=factor(
        levels=c("u", "014", "1524", "2534", "3544", "4554", "5564", "65"),
        ordered=TRUE
      )
    )
  )

#Note: we can't adjust the labels in the names_ptype above using labels=c(...).
#Thus, we'd need to use recode_factor, similar to the above
tidy3 <- tidy3 %>% mutate(
  age_range=recode_factor(age_range,
                          "014"="0-14",
                          "1524"="15-24",
                          "2534"="25-34",
                          "3544"="35-44",
                          "4554"="45-54",
                          "5564"="55-64",
                          "65"="65+",
                          "u"="unknown", .default=NA_character_
  ))

tidy3 %>% kable_table(n=8)

country	year	sex	age_range	freq
AD	2000	m	0-14	0
AD	2000	m	15-24	0
AD	2000	m	25-34	1
AD	2000	m	35-44	0
AD	2000	m	45-54	0
AD	2000	m	55-64	0
AD	2000	m	65+	0
AD	2000	m	unknown	NA

unite example

Although the least common of the tidying verbs (in my experience), unite is the complement to separate and can be used to bring together multiple variables that we wish to store as a single variable. For example, we may have first name and last name stored in separate variables, but wish to put them together for display or exporting purposes. Sometimes, we also use unite as an intermediate stage in tidying, bringing together variables, reshaping the data, then re-separating them.

first_name	last_name	age	favorite_color
Graham	Doe	11	Purple
Kieran	Helali	9	Blue
Charlotte	Stafford	11	Pink

Tidying solution

If we wanted to have a full_name, we could use unite to combine first_name and last_name and then get rid of those individual columns.

Code

df4_united <- df4 %>% unite(col = "full_name", first_name, last_name, sep=" ")

full_name	age	favorite_color
Graham Doe	11	Purple
Kieran Helali	9	Blue
Charlotte Stafford	11	Pink

Advanced reshaping

data.table melt/dcast

The data.table package provides melt() and dcast() for fast reshaping. The formula interface is compact for multi-way summary tables, and dcast() can aggregate on the fly while casting.

Code

dt <- data.table::data.table(
  id = 1:4,
  group = c("A", "A", "B", "B"),
  y2022 = c(10, 12, 9, 11),
  y2023 = c(13, 14, 10, 12),
  z2022 = c(100, 120, 90, 110),
  z2023 = c(130, 140, 95, 115)
)

long_dt <- data.table::melt(
  dt,
  id.vars = c("id", "group"),
  measure.vars = patterns("^y", "^z"),
  variable.name = "year",
  value.name = c("y", "z")
)

# Map pattern indices to actual year labels.
long_dt$year <- c("2022", "2023")[as.integer(long_dt$year)]
long_dt

id	group	year	y	z
1	A	2022	10	100
2	A	2022	12	120
3	B	2022	9	90
4	B	2022	11	110
1	A	2023	13	130
2	A	2023	14	140
3	B	2023	10	95
4	B	2023	12	115

Code

mean_y <- data.table::dcast(
  long_dt,
  group ~ year,
  value.var = "y",
  fun.aggregate = mean
)

mean_y

group	2022	2023
A	11	13.5
B	10	11.0

Code

mean_multi <- data.table::dcast(
  long_dt,
  group ~ year,
  value.var = c("y", "z"),
  fun.aggregate = mean
)

mean_multi

group	y_2022	y_2023	z_2022	z_2023
A	11	13.5	110	135
B	10	11.0	100	105

Compared with tidyr, data.table::dcast() makes aggregation part of the casting step and can cast multiple value columns in one pass. It is also a common choice for very large tables where performance matters.

Further reshaping extensions using data.table package: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

Data structure semantics

Variables

Observations

Tidy data

Tidying verbs

Data tidying examples

pivot_longer example

Messy version

Tidy version

Tidying solution

pivot_wider example

Messy version

Tidy version

Tidying solution

Missingness: implicit vs explicit

separate example

Messy version

Tidy version

Tidying solution

Modern separate_* alternatives

Tidying solution using pivot_longer alone

unite example

Tidying solution

Advanced reshaping

data.table melt/dcast

Modern `separate_*` alternatives