Data wrangling in dplyr

Author

Michael Hallquist, PSYC 859

Published

March 12, 2026

The goal of this document is to provide a basic introduction to data wrangling using functions from the so-called ‘tidyverse’ approach. The tidyverse (https://www.tidyverse.org) is a set of data science packages in R that are intended to provide a consistent paradigm for working with data. This approach unifies a previously inchoate landscape of different functions and packages in R that could be daunting to new users.

Although I do not claim that the tidyverse approach is best according to all possible criteria, I believe that it is the best paradigm for working with data in R for social scientists, many of whom do not have a formal background in computer programming.

Here, I will draw primarily from the tidyr and dplyr packages in R.

For an excellent book-length treatment of the tidyverse approach, see R for Data Science (2nd edition) by Hadley Wickham, Mine Cetinkaya-Rundel, and Garrett Grolemund.

Before we start: beware namespace collisions!

One of the most irritating problems you may encounter in the tidyverse world (and more generally, in R) is when code that previously worked suddenly throws an inexplicable error.

For example:

> survey %>% group_by(R_exp) %>% 
summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))

Error in summarize(., m_age = mean(Psych_age_yrs), sd_age = sd(Psych_age_yrs)) : 
argument "by" is missing, with no default

By using fairly intuitive data wrangling verbs such as ‘summarize’ and ‘select’, dplyr (and sometimes tidyr) sometimes uses the same function names as other packages. For example, Hmisc has a summarize function that does not operate in the same way as summarize in dplyr. Also, the predecessor to dplyr was called plyr. Although largely outmoded, it has a few remaining functions that may be useful. But… many of these functions have the same names in dplyr but operate differently (the syntax is not the same!), which can be a common source of collisions when using dplyr.

This points to the problem of what are called ‘namespace collisions.’ That is, when R looks for a function (or any object) in the global environment, it searches through a ‘path’. You can see the nitty gritty using searchpaths(). But the TL;DR is that if you – or any function you call on – loads another package, that package may override a dplyr function and make your code crash!

Addressing namespace collisions

  1. Watch out for warnings about objects being ‘masked’ when packages are loaded.
  2. Explicitly specify the package where your desired function lives using the double colon operator. Example: dplyr::summarize.
  3. Try to load tidyverse packages using library(tidyverse). This handles collisions within the tidyverse!

For example, in this document, multilevel pulls in MASS, which masks dplyr::select. If we were to load dplyr first, then multilevel, we would see the following:

> library(dplyr)
> library(multilevel)
Loading required package: nlme

Attaching package: ‘nlme’

The following object is masked from ‘package:dplyr’:

    collapse

Loading required package: MASS

Attaching package: ‘MASS’

The following object is masked from ‘package:dplyr’:

    select

Read in English, this says, “The multilevel package needs to load nlme (a mixed-effects package). Nlme has a function called ‘collapse’ that will now take precedence over ‘collapse’ in dplyr if you run collapse() in your session. Multilevel also needs the MASS package, so I’m loading it. The MASS package has a function called ‘select’ that now takes precedence over dplyr’s select function, so if you call select() in your session, you will be getting the MASS package function, not the dplyr one.”

To get around these problems, the easiest route is to load dplyr (and other tidyverse packages) last in the chain, giving them precedence over other packages. As you see above in the p_load call, we load dplyr last in the chain so that its select function is preferred over MASS. This is important because functions of the same name across packages may have completely different purposes and syntax!

The slightly harder, but more robust, route is always to use explicit namespace qualifiers with the :: operator. If you use dplyr::select as the function call, you are explicitly telling R that you want dplyr’s select function, not some other package’s.

The two most common collisions are with select (MASS) and summarize (Hmisc). If you’re not sure which version of a function has taken precedence you can type the function name without parentheses. This will print the function source (which you can ignore). But take a look at the last line, which reveals where R ‘found’ the function in the search across loaded packages (based on their load order/precedence).

> select
function (obj) 
UseMethod("select")
<bytecode: 0x93b18ebd8>
<environment: namespace:MASS>

If you were expecting that last line to read <environment: namespace:dplyr> (i.e. dplyr’s version), then you have a namespace problem to resolve!

Data pipelines

The tidyverse paradigm encourages the use of so-called data pipelines when writing the syntax for a multi-step data transformation procedure. The pipe operator %>% is provided by the magrittr package, which is loaded by dplyr. Data pipeline syntax is intended to provide a readable syntax for the order in which data operations are performed. You can think of this as a recipe for data wrangling. For example:

1. Read data from the file: 'mydata.csv'
2. Rename SDATE to submission_date; rename ed to education
3. Only keep subject IDs above 10 (the first 9 were in-lab pilot subjects)
4. Drop the timestamp and survey_settings variables
5. Compute the log of income as a new variable called log_income
6. Arrange the data by ID, then submission_date (where each subject submitted many surveys)
7. Ensure that ID and submission_date are the left-most columns in the data

We would write this out as a dplyr pipeline using the pipe operator %>% to chain together data operations.

dataset <- read.csv("mydata.csv") %>%
  rename(submission_date=SDATE, education=ed) %>%
  filter(ID < 10) %>%
  select(-timestamp, -survey_settings) %>%
  mutate(log_income = log(income)) %>%
  arrange(ID, submission_date) %>%
  select(ID, submission_date, everything())

Use of ‘this’ reference in tidyverse

Sometimes it is useful to refer to the current dataset or variable explicitly in tidyverse data wrangling syntax.

dplyr/magrittr tends to hide this from us for convenience, but it’s there under the hood.

iris %>% filter(Sepal.Length > 7)

is the same as

iris %>% filter(., Sepal.Length > 7)

So, '.' refers to the current dataset or variable (depending on context) in dplyr operations. And if you don’t specify where the '.' falls in your syntax, it will always be passed as the first argument to the downstream function.

Special values to watch out for in data wrangling

  • NA: missing
    • na.rm=TRUE available in many functions
    • Also, see na.omit(), na.exclude(), na.fail(), na.pass()
  • NULL: null set
    • Often used when something is undefined
  • Inf: infinite
  • NaN: Not a number.
    • Result of an invalid computation, e.g., log(-1)
  • warnings(): If R mentions a warning in data wrangling, make sure you’ve handled it or know its origin. Don’t just ignore these!

dplyr fundamentals

The dplyr package is at the core of data wrangling in the tidyverse, providing a set of wrangling verbs that collectively form a coherent language for data transformations.

Overview: a first pass through an NHANES dataset

To learn dplyr, let’s start with a survey from the National Health and Nutrition Examination Survey (NHANES) dataset. These data are provided in the nhanesA package. We’ll start by looking at a couple of basic demographic variables.

NHANES variables used below (from DEMO_D and BMX_D):

  • SEQN: respondent ID
  • RIAGENDR: gender (coded numeric in raw NHANES)
  • RIDAGEYR: age in years
  • DMDHREDU: education level
  • BMXWT: weight (kg)
  • BMXARML: upper arm length (cm)
  • BMXTHICR: thigh circumference (cm)
Code
# demographics: note that this code relies on functions in the nhanesA package, which
#   is useful for working with NHANES data, but is not a focus of this workshop per se.

demo_d <- nhanes('DEMO_D')
demo_d_vars  <- nhanesTableVars('DEMOGRAPHICS', 'DEMO_D', namesonly=TRUE)

# translate numeric codes into factors by using the nhanes lookup/codebook functions
demo_d <- suppressWarnings(nhanesTranslate('DEMO_D', demo_d_vars, data=demo_d))
Translated columns: AIALANG DMDBORN DMDCITZN DMDEDUC2 DMDEDUC3 DMDHRBRN DMDHREDU DMDHRGND DMDHRMAR DMDHSEDU DMDMARTL DMDSCHOL DMDYRSUS DMQMILIT FIAINTRP FIALANG FIAPROXY INDFMINC INDHHINC MIAINTRP MIALANG MIAPROXY RIAGENDR RIDEXMON RIDEXPRG RIDRETH1 RIDSTATR SDDSRVYR SIAINTRP SIALANG SIAPROXY
Code
# dplyr pipeline
demo_d <- demo_d %>% 
  filter(!INDHHINC %in% c("Over $20,000", "Under $20,000", "Refused", "Don't know")) %>% 
  droplevels() %>% # drop unused factor levels from the data.frame
  mutate(
    # case_when() is a vectorized if/else ladder for many conditions
    income_num = case_when( # convert range-based factor labels to midpoint numbers
    INDHHINC == "$     0 to $ 4,999" ~ 2500,
    INDHHINC == "$ 5,000 to $ 9,999" ~ 7500,
    INDHHINC == "$10,000 to $14,999" ~ 12500,
    INDHHINC == "$15,000 to $19,999" ~ 17500,
    INDHHINC == "$20,000 to $24,999" ~ 22500,
    INDHHINC == "$25,000 to $34,999" ~ 30000,
    INDHHINC == "$35,000 to $44,999" ~ 40000,
    INDHHINC == "$45,000 to $54,999" ~ 50000,
    INDHHINC == "$55,000 to $64,999" ~ 60000,
    INDHHINC == "$65,000 to $74,999" ~ 70000,
    INDHHINC == "$75,000 and Over" ~ 80000
    )
  )
 
#load body (biometric) measures
bmx_d <- nhanes('BMX_D')
bmx_d_vars  <- nhanesTableVars('EXAM', 'BMX_D', namesonly=TRUE)
bmx_d <- suppressWarnings(nhanesTranslate('BMX_D', bmx_d_vars, data=bmx_d))
Translated columns: BMDSTATS BMIARMC BMIARML BMICALF BMIHT BMILEG BMIRECUM BMISUB BMITHICR BMITRI BMIWAIST BMIWT
Code
# merge the education demographics variable with the biometric data, joining on the SEQN column (basically an ID)
bmx_d <- demo_d %>% 
  dplyr::select(SEQN, DMDHREDU) %>% 
  inner_join(bmx_d, by="SEQN")

group_by + summarize

Let’s summarize the mean household income (income_num), converted from categories in INDHHINC) by highest level of education completed (DMDHREDU)

Code
demo_d %>% 
  group_by(DMDHREDU) %>% # divide dataset into separate compartments by education level
  dplyr::summarize(
    n=n(), # number of observations
    m_income=mean(income_num, na.rm=T), 
    sd_income=sd(income_num, na.rm=T)
  ) %>% 
  kable_table()
DMDHREDU n m_income sd_income
Less Than 9th Grade 1173 28024 18489
9-11th Grade (Includes 12th grade with no diploma) 1626 31318 21584
High School Grad/GED or equivalent 2317 39400 23422
Some College or AA degree 2664 48137 25326
College Graduate or above 1768 63785 21816
Refused 6 25000 12624
Don't know 26 20900 15760
NA 276 44623 24849

Note that summarize removes a single level of grouping in the group_by process. Here, we only have one grouping variable, DMDHREDU, so the output of summarize will be ‘ungrouped.’

Grouped summaries of several variables

What if I want to have means and SDs for several continuous variables grouped by highest education? Let’s look specifically at the weight (BMXWT), upper arm length (BMXARML), and thigh circumference (BMXTHICR) measurements. The combination of summarize and across provide functionality to specify several variables using the .cols argument of across and potentially several summary functions by passing them in a named list.

Code
bmx_d %>% 
  group_by(DMDHREDU) %>% 
  dplyr::summarize(
    across(
      c(BMXWT, BMXARML, BMXTHICR), 
      list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T))
    )
  ) %>% 
  kable_table()
DMDHREDU BMXWT_m BMXWT_sd BMXARML_m BMXARML_sd BMXTHICR_m BMXTHICR_sd
Less Than 9th Grade 58 30 32 7.8 50 7.48
9-11th Grade (Includes 12th grade with no diploma) 56 33 32 8.2 51 8.40
High School Grad/GED or equivalent 61 33 33 7.9 52 8.10
Some College or AA degree 62 32 33 7.6 52 8.65
College Graduate or above 61 31 33 7.8 52 7.99
Refused 70 25 38 2.5 50 0.57
Don't know 50 36 28 11.5 53 7.74
NA 56 29 32 7.6 51 7.73

Let’s slow this down:

group_by verb

survey %>% group_by(DMDHREDU)

This tells dplyr to divide the bmx_d (NHANES biometric measurements) data into a set of smaller data.frame objects, one per level of DMDHREDU. Internally, this looks something like the output below. After this division of the dataset into chunks, summarize will work on each chunk individually.

$`Less Than 9th Grade`
   SEQN            DMDHREDU                    BMDSTATS BMXWT
1 31137 Less Than 9th Grade          Other partial exam    80
2 31157 Less Than 9th Grade Complete data for age group    42
3 31169 Less Than 9th Grade Complete data for age group    22
4 31175 Less Than 9th Grade Complete data for age group    86

$`9-11th Grade (Includes 12th grade with no diploma)`
   SEQN                                           DMDHREDU
1 31128 9-11th Grade (Includes 12th grade with no diploma)
2 31133 9-11th Grade (Includes 12th grade with no diploma)
3 31145 9-11th Grade (Includes 12th grade with no diploma)
4 31148 9-11th Grade (Includes 12th grade with no diploma)
                     BMDSTATS BMXWT
1 Complete data for age group    40
2 Complete data for age group    45
3 Complete data for age group    40
4 Complete data for age group    52

$`High School Grad/GED or equivalent`
   SEQN                           DMDHREDU                    BMDSTATS BMXWT
1 31127 High School Grad/GED or equivalent Complete data for age group    10
2 31139 High School Grad/GED or equivalent          Other partial exam    74
3 31140 High School Grad/GED or equivalent Complete data for age group    42
4 31143 High School Grad/GED or equivalent Complete data for age group    76

$`Some College or AA degree`
   SEQN                  DMDHREDU                    BMDSTATS BMXWT
1 31129 Some College or AA degree Complete data for age group    75
2 31130 Some College or AA degree  No body measures exam data    NA
3 31138 Some College or AA degree Complete data for age group    14
4 31142 Some College or AA degree          Other partial exam    80

$`College Graduate or above`
   SEQN                  DMDHREDU                    BMDSTATS BMXWT
1 31131 College Graduate or above          Other partial exam    75
2 31132 College Graduate or above Complete data for age group    70
3 31135 College Graduate or above Complete data for age group    10
4 31141 College Graduate or above Complete data for age group    60

$Refused
   SEQN DMDHREDU                             BMDSTATS BMXWT
1 33385  Refused          Complete data for age group    78
2 35296  Refused Partial:  Height and weight obtained    38
3 37761  Refused          Complete data for age group    67
4 40762  Refused Partial:  Height and weight obtained    99

$`Don't know`
   SEQN   DMDHREDU                    BMDSTATS BMXWT
1 31546 Don't know          Other partial exam    78
2 32106 Don't know Complete data for age group    12
3 32626 Don't know Complete data for age group    64
4 32648 Don't know Complete data for age group    85

summarize + across

The summarize function transforms a dataset that has many rows to a dataset that has a single row per grouping unit. If you do not use group_by, summarize will yield an overall summary statistic in the entire dataset. For example, to get the mean and SD of household income in NHANES, irrespective of education level or other categorical moderators, we could just use a simple summarize:

Code
demo_d %>%
  dplyr::summarize(
    m_income=mean(income_num, na.rm=T), 
    sd_income=sd(income_num, na.rm=T)
  ) %>% 
  kable_table()
m_income sd_income
43590 25726

But because we used group_by(DMDHREDU) above, we got unique summaries of the variables at each level of education.

The across function accepts two primary arguments. First, we specify a set of variables (the .cols argument) that we wish to summarize in the same way (i.e., compute the same summary statistics). Second, we specify which statistics we wish to compute (the .fns argument). In our case, the syntax was:

dplyr::summarize(
    across(c(BMXWT, BMXARML, BMXTHICR), 
           list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T)))
)

The c() function specifies a vector of unquoted variable names in the dataset we wish to summarize, separated by commas. Note that any tidyselect operator for selecting columns will work. This includes starts_with, ends_with, matches, and others. For details, see ?dplyr_tidy_select.

The list() here asks dplyr to run each function in the list against each variables in the .cols specification. Here, this means that dplyr will compute the mean and SD of each variable in the .cols argument at each level of education completed (the group_by basis). The names of the list elements (left side) — here, m and sd — become the suffixes added for each variable. The value of the element (right side) — here, mean and sd — are the functions that should be used to compute a summary statistic (they should return one number per grouped variable).

Note that the column names that result from an across operation can be modified using the .names argument. The default is to suffix the variable name with the names of the functions list, preceded by an underscore (e.g., income_m).

Passing arguments to summary functions

Notice how we passed na.rm=TRUE to the mean function within the list. This tells the mean to ignore missing (NA) values when computing the mean (i.e., mean of the non-missing numbers). In general, the dplyr syntax using what they call “lambdas” (starting with ~) is the clearest way to control the arguments passed to each function. Here is a simple definition of a lambda in R:

~ mean(.x, na.rm=TRUE)

The .x refers to the current variable being used within a dplyr data wrangling operation. This is in contrast to ., which generally refers to the current dataset.

If you don’t need to pass arguments to the functions in an across() operation, you can just state the function name:

dplyr::summarize(
    across(c(BMXWT, BMXARML, BMXTHICR), 
           list(m=mean, sd=sd))
)

Making a summarize pipeline even more beautiful

We can also make the output more beautiful using tidying techniques we’ve already seen in the tidyr tutorial. Remember that R is all about programming for data science. In particular, notice that we have some columns that are means and others that are SDs.

We can just extend our data pipeline a bit. The extract function from tidyr here is like separate, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a useful tutorial here: http://www.regular-expressions.info/tutorial.html.

Code
bmx_d %>% 
  group_by(DMDHREDU) %>% 
  dplyr::summarize(
    across(
      c(BMXWT, BMXARML, BMXTHICR), 
      list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T))
    )
  ) %>%
  
   # combine m and sd statistics (notice how you can add comments inline within a pipeline?)
  pivot_longer(cols=-DMDHREDU, names_to = "Measure", values_to = "value") %>%
  
  # divide income_m into income and m
  #extract(col=Measure, into=c("bio_measure", "statistic"), regex=("(.*)_(.*)$")) %>% 
  separate(col=Measure, into=c("bio_measure", "statistic"), sep="_") %>% 
  pivot_wider(names_from=statistic, values_from = value) %>% 
  arrange (bio_measure, DMDHREDU) %>%
  kable_table()
DMDHREDU bio_measure m sd
Less Than 9th Grade BMXARML 32 7.77
9-11th Grade (Includes 12th grade with no diploma) BMXARML 32 8.22
High School Grad/GED or equivalent BMXARML 33 7.86
Some College or AA degree BMXARML 33 7.61
College Graduate or above BMXARML 33 7.81
Refused BMXARML 38 2.47
Don't know BMXARML 28 11.49
NA BMXARML 32 7.62
Less Than 9th Grade BMXTHICR 50 7.48
9-11th Grade (Includes 12th grade with no diploma) BMXTHICR 51 8.40
High School Grad/GED or equivalent BMXTHICR 52 8.10
Some College or AA degree BMXTHICR 52 8.65
College Graduate or above BMXTHICR 52 7.99
Refused BMXTHICR 50 0.57
Don't know BMXTHICR 53 7.74
NA BMXTHICR 51 7.73
Less Than 9th Grade BMXWT 58 30.15
9-11th Grade (Includes 12th grade with no diploma) BMXWT 56 33.44
High School Grad/GED or equivalent BMXWT 61 32.70
Some College or AA degree BMXWT 62 32.48
College Graduate or above BMXWT 61 31.30
Refused BMXWT 70 25.08
Don't know BMXWT 50 36.00
NA BMXWT 56 29.12

arrange: order observations

Toward the end of the pipeline above, we see:

arrange (bio_measure, DMDHREDU) %>%

The arrange verb in dplyr requests that observations be sorted according to one or more variables. Here, we ask for the dataset to be sorted by bio_measure (biometric measure, such as weight) first, then by education level within that measure. The arrange verb sorts observations in ascending order (low to high) by default, but data can be sorted in descending order using the desc() function:

arrange(bio_measure, desc(DMDHREDU)) %>%

This would sort by highest to lowest education level within each measure (bio_measure), where the bio_measures are still in ascending (alphabetical) order

Examining a more complex multilevel dataset using dplyr

Let’s examine the univbct data, which contains longitudinal observations of job satisfaction, commitment, and readiness to deploy. From the documentation ?univbct:

This data set contains the complete data set used in Bliese and Ployhart (2002). The data is longitudinal data converted to univariate (i.e., stacked) form. Data were collected at three time points. A data frame with 22 columns and 1485 observations from 495 individuals.

We have 1485 observations of military personnel nested within companies, which are nested within batallions: https://en.wikipedia.org/wiki/Battalion.

Code
data(univbct, package="multilevel")
str(univbct)
'data.frame':   1485 obs. of  22 variables:
 $ BTN    : num  1022 1022 1022 1004 1004 ...
 $ COMPANY: Factor w/ 8 levels "A","B","C","D",..: 6 6 6 4 4 4 2 2 2 2 ...
 $ MARITAL: num  1 1 1 4 4 4 2 2 2 2 ...
 $ GENDER : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HOWLONG: num  2 2 2 0 0 0 0 0 0 1 ...
 $ RANK   : num  12 12 12 13 13 13 15 15 15 14 ...
 $ EDUCATE: num  2 2 2 2 2 2 2 2 2 2 ...
 $ AGE    : num  20 20 20 24 24 24 24 24 24 23 ...
 $ JOBSAT1: num  1.67 1.67 1.67 3.67 3.67 ...
 $ COMMIT1: num  1.67 1.67 1.67 1.67 1.67 ...
 $ READY1 : num  2.75 2.75 2.75 3 3 3 3.75 3.75 3.75 2.5 ...
 $ JOBSAT2: num  1 1 1 4 4 ...
 $ COMMIT2: num  1.67 1.67 1.67 1.33 1.33 ...
 $ READY2 : num  1 1 1 2 2 2 3.75 3.75 3.75 3.25 ...
 $ JOBSAT3: num  3 3 3 4 4 4 4 4 4 3 ...
 $ COMMIT3: num  3 3 3 1.33 1.33 ...
 $ READY3 : num  3 3 3 1.75 1.75 1.75 1.75 1.75 1.75 3 ...
 $ TIME   : num  0 1 2 0 1 2 0 1 2 0 ...
 $ JSAT   : num  1.67 1 3 3.67 4 ...
 $ COMMIT : num  1.67 1.67 3 1.67 1.33 ...
 $ READY  : num  2.75 1 3 3 2 1.75 3.75 3.75 1.75 2.5 ...
 $ SUBNUM : num  1 1 1 2 2 2 3 3 3 4 ...

Let’s enact the core ‘verbs’ of dplyr to understand and improve the structure of these data.

filter: obtaining observations (rows) based on some criteria

Objective: Retain only men in company A

Code
COMPANY="A"
COMPANY <- "A"

company_A_men <- filter(univbct, COMPANY=="A" & GENDER==1)
#print 10 observations at random to check the accuracy of the filter
#p=11 just shows the first 11 columns to keep it on one page for formatting
company_A_men %>% sample_n(10) %>% kable_table(p=11)
BTN COMPANY MARITAL GENDER HOWLONG RANK EDUCATE AGE JOBSAT1 COMMIT1 READY1
232 1022 A 2 1 2 15 2 29 3.7 3.3 3.2
1373 404 A 2 1 1 16 4 36 4.0 4.0 3.2
1395 3066 A 2 1 2 15 2 26 4.0 4.0 3.8
1235 3066 A 2 1 2 16 2 34 3.0 3.7 3.0
1420 3066 A 2 1 3 13 3 21 3.0 3.7 2.8
1208 4 A 2 1 3 15 2 28 2.3 2.7 1.8
233 1022 A 2 1 2 15 2 29 3.7 3.3 3.2
1159 299 A 2 1 3 14 2 24 2.7 3.0 2.5
445 1022 A 1 1 1 14 2 23 3.7 3.3 3.5
616 4042 A 1 1 2 13 2 19 1.3 3.3 2.5

Objective: Count how many people are in companies A and B

Code
filter(univbct, COMPANY %in% c("A","B")) %>% nrow()
[1] 750

Objective: What about counts by company and battalion?

Code
univbct %>% 
  group_by(BTN, COMPANY) %>% 
  tally() %>%
  kable_table(n=12)
BTN COMPANY n
4 A 66
4 B 15
4 C 12
4 D 30
4 HHC 18
104 A 12
104 HHC 3
124 A 42
144 A 30
299 A 39
299 B 30
299 C 27
Code
# N.B. The same result could be obtained with count(BTN, COMPANY) alone.
#  This combines the group_by and tally functions

select: choose variables (columns) based on some criteria

Let’s start by keeping only the three core dependent variables over time: jobsat, commit, ready. Keep SUBNUM as well for unique identification.

Code
dvs_only <- univbct %>% 
  dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                COMMIT1, COMMIT2, COMMIT3, 
                READY1, READY2, READY3)

If you have many variables of a similar name, you might try starts_with(). Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.

Code
dvs_only <- univbct %>% 
  dplyr::select(SUBNUM, starts_with("JOBSAT"), starts_with("COMMIT"), starts_with("READY"))

Other selection mechanisms:

  • contains: variable name contains a literal string
  • starts_with: variable names start with a string
  • ends_with: variable names end with a string
  • num_range: variables that have a common prefix (e.g., ‘reasoning’) and a numeric range (e.g., 1-20)
  • matches: variable name matches a regular expression
  • one_of: variable is one of the elements in a character vector. Example: select(one_of(c(“A”, “B”)))

See ?select_helpers for more details.

select + filter: zooming in on specific observations and variables

Note that select and filter can be combined to subset both observations and variables of interest.

For example, look at readiness to deploy in battalion 299 only:

Code
univbct %>% 
  filter(BTN==299) %>% 
  dplyr::select(SUBNUM, READY1, READY2, READY3) %>% 
  kable_table(n=6)
SUBNUM READY1 READY2 READY3
10 4 2.5 3.2 3.0
11 4 2.5 3.2 3.0
12 4 2.5 3.2 3.0
19 7 2.0 1.8 1.2
20 7 2.0 1.8 1.2
21 7 2.0 1.8 1.2

select is also useful for dropping variables that are not of interest using a kind of subtraction syntax.

Code
nojobsat <- univbct %>% 
  dplyr::select(-starts_with("JOBSAT"))
names(nojobsat)
 [1] "BTN"     "COMPANY" "MARITAL" "GENDER"  "HOWLONG" "RANK"    "EDUCATE"
 [8] "AGE"     "COMMIT1" "READY1"  "COMMIT2" "READY2"  "COMMIT3" "READY3" 
[15] "TIME"    "JSAT"    "COMMIT"  "READY"   "SUBNUM" 

mutate: add one or more variables that are a function of other variables

(Row-wise) mean of commit scores over waves. Note how you can used select() within a mutate to run a function on a subset of the data.

Code
univbct <- univbct %>% 
  mutate(commitmean=rowMeans(dplyr::select(., COMMIT1, COMMIT2, COMMIT3)))

Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the starts_with function here within the across(). This operates identically to select, but in the context of a summary or mutation operation on specific variables. See ?select_helpers for details.

Code
meancent <- function(x) { x - mean(x, na.rm=TRUE) } #simple worker function to mean center a variable

univbct <- univbct %>% 
  mutate(across(starts_with("COMMIT", ignore.case = FALSE), list(cm=meancent, pct=percent_rank)))

univbct %>%
  dplyr::select(starts_with("COMMIT", ignore.case = FALSE)) %>%
  kable_table(n=8) %>% kable_styling(font_size = 12)
COMMIT1 COMMIT2 COMMIT3 COMMIT COMMIT1_cm COMMIT1_pct COMMIT2_cm COMMIT2_pct COMMIT3_cm COMMIT3_pct COMMIT_cm COMMIT_pct
1.7 1.7 3.0 1.7 -1.95 0.01 -1.80 0.04 -0.54 0.12 -1.87 0.02
1.7 1.7 3.0 1.7 -1.95 0.01 -1.80 0.04 -0.54 0.12 -1.87 0.02
1.7 1.7 3.0 3.0 -1.95 0.01 -1.80 0.04 -0.54 0.12 -0.54 0.15
1.7 1.3 1.3 1.7 -1.95 0.01 -2.13 0.03 -2.20 0.01 -1.87 0.02
1.7 1.3 1.3 1.3 -1.95 0.01 -2.13 0.03 -2.20 0.01 -2.21 0.01
1.7 1.3 1.3 1.3 -1.95 0.01 -2.13 0.03 -2.20 0.01 -2.21 0.01
3.3 3.3 3.7 3.3 -0.28 0.27 -0.13 0.33 0.13 0.45 -0.21 0.30
3.3 3.3 3.7 3.3 -0.28 0.27 -0.13 0.33 0.13 0.45 -0.21 0.30

arrange: reorder observations in specific order

Order data by ascending battalion, company, then subnum

Code
univbct <- univbct %>% 
  arrange(BTN, COMPANY, SUBNUM)

Descending sort: descending battalion, ascending company, ascending subnum

Code
univbct <- univbct %>% 
  arrange(desc(BTN), COMPANY, SUBNUM)

A more realistic example: preparation for multilevel analysis

In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).

We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -1 as the sep argument to separate splits the string at the second-to-last position (i.e., starting at the right).

For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’ In other words, there is a peculiar redundancy in the data that is altogether confusing.

Take a look:

Code
univbct %>%
  dplyr::select(SUBNUM, starts_with("JOBSAT"), JSAT) %>% 
  kable_table(n=12)
SUBNUM JOBSAT1 JOBSAT2 JOBSAT3 JSAT
319 103 2.0 2.3 3.3 2.0
320 103 2.0 2.3 3.3 2.3
321 103 2.0 2.3 3.3 3.3
397 129 3.7 4.3 4.7 3.7
398 129 3.7 4.3 4.7 4.3
399 129 3.7 4.3 4.7 4.7
523 171 3.7 4.0 NA 3.7
524 171 3.7 4.0 NA 4.0
525 171 3.7 4.0 NA NA
616 202 1.3 2.0 4.3 1.3
617 202 1.3 2.0 4.3 2.0
618 202 1.3 2.0 4.3 4.3

We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).

Code
univbct <- univbct %>% 
  group_by(SUBNUM) %>% # split into separate groups for each subject
  filter(row_number() == 1) %>% # only retain the first row of each subject
  dplyr::select(-JSAT, -COMMIT, -READY) %>% # drop redundant columns
  ungroup() # remove grouping from data structure (we are done with group-based wrangling)

First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer)

Code
forMLM <- univbct %>% 
  dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                COMMIT1, COMMIT2, COMMIT3, 
                READY1, READY2, READY3) %>% 
  
  # pivot everything but SUBNUM
  pivot_longer(names_to = "key", values_to = "value", cols=-SUBNUM) %>%
  
  # -1 splits at the last character of the variable name
  separate(col="key", into=c("variable", "occasion"), -1, convert=TRUE) %>%
  pivot_wider(names_from = variable, values_from = value) #%>% 
  #mutate(occasion=as.integer(occasion))

Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.

Code
forMLM <- forMLM %>% group_by(SUBNUM) %>% 
  mutate(across(c(COMMIT, JOBSAT, READY), list(wic=meancent, pm=mean))) %>%
  ungroup()

forMLM %>% kable_table(n=10) %>% kable_styling(font_size = 14)
SUBNUM occasion JOBSAT COMMIT READY COMMIT_wic COMMIT_pm JOBSAT_wic JOBSAT_pm READY_wic READY_pm
103 1 2.0 3.7 4.0 0.00 3.7 -0.56 2.6 1.25 2.8
103 2 2.3 3.7 2.0 0.00 3.7 -0.22 2.6 -0.75 2.8
103 3 3.3 3.7 2.2 0.00 3.7 0.78 2.6 -0.50 2.8
129 1 3.7 5.0 2.5 0.44 4.6 -0.56 4.2 -0.33 2.8
129 2 4.3 4.3 2.8 -0.22 4.6 0.11 4.2 -0.08 2.8
129 3 4.7 4.3 3.2 -0.22 4.6 0.44 4.2 0.42 2.8
171 1 3.7 4.0 3.2 -0.17 NA -0.17 NA -0.12 NA
171 2 4.0 4.3 3.5 0.17 NA 0.17 NA 0.12 NA
171 3 NA NA NA NA NA NA NA NA NA
202 1 1.3 3.3 2.5 0.44 2.9 -1.22 2.6 -0.75 3.2

Reference: Rows, columns, and groups in dplyr

Most dplyr workflows can be organized around three kinds of operations: selecting rows, selecting columns, and operating within groups.

Code
demo_q <- nhanes('DEMO_D')
bmx_q <- nhanes('BMX_D')

Rows: keep or reorder observations

Code
demo_q %>%
  filter(RIDAGEYR > 25, RIAGENDR == "Male") %>%  # filter rows by logical and: over 25 and male 
  arrange(desc(RIDAGEYR)) %>%                # order rows by descending age
  distinct(DMDHREDU, .keep_all = TRUE) %>%   # keep first row per education level for inspection
  select(SEQN, RIDAGEYR, RIAGENDR, DMDHREDU) %>% # subset down to a few variables of interest
  
  # Watch out for attempts to refer to variables you have 'selected out' earlier in the pipeline.
  # Here, DMDHRBRN isn't in the select statement above, so we can't use it in a downstream mutate call
  # mutate(
  #   in_us = if_else(DMDHRBRN == "Born in 50 US States or Washington, DC", TRUE, FALSE)
  # ) %>%
  kable_table()
SEQN RIDAGEYR RIAGENDR DMDHREDU
31447 85 Male High School Grad/GED or equivalent
31558 85 Male 9-11th Grade (Includes 12th grade with no diploma)
31610 85 Male Some College or AA degree
32444 85 Male Less Than 9th Grade
32455 85 Male College Graduate or above
35818 85 Male NA
39906 78 Male Refused
31634 67 Male Don't know

Row filters can also use if_any() and if_all() across multiple columns:

Here, using if_any, we find observations where weight (BMXWT in kg), arm length (BMXARML in cm), or thigh circumference (BMXTHICR in cm) is greater than 60 (setting aside the different units of measurement).

Code
bmx_q %>%
  filter(if_any(c(BMXWT, BMXARML, BMXTHICR), ~ .x > 60)) %>%  # any column above 60
  slice_sample(n = 5) %>% # get a random sample of 5 rows
  kable_table()
SEQN BMDSTATS BMXWT BMIWT BMXRECUM BMIRECUM BMXHEAD BMIHEAD BMXHT BMIHT BMXBMI BMXLEG BMILEG BMXCALF BMICALF BMXARML BMIARML BMXARMC BMIARMC BMXWAIST BMIWAIST BMXTHICR BMITHICR BMXTRI BMITRI BMXSUB BMISUB
33990 Partial: Height and weight obtained 79 Clothing NA NA NA NA 165 NA 29 NA Could not obtain NA Could not obtain NA Could not obtain NA Could not obtain NA Could not obtain NA Could not obtain NA Could not obtain NA Could not obtain
35208 Complete data for age group 122 NA NA NA NA NA 191 NA 34 37 NA 46 NA 43 NA 36 NA 132 NA 54 NA 15.0 NA 31 NA
34966 Complete data for age group 72 NA NA NA NA NA 165 NA 27 36 NA 38 NA 36 NA 34 NA 92 NA 51 NA 8.4 NA 21 NA
35434 Complete data for age group 98 NA NA NA NA NA 177 NA 31 44 NA 39 NA 41 NA 30 NA 118 NA 55 NA 21.0 NA 23 NA
31945 Complete data for age group 89 NA NA NA NA NA 175 NA 29 36 NA 39 NA 39 NA 37 NA 117 NA 52 NA 19.2 NA 23 NA

By comparison, here are rows where all three of these are greater than 30, using if_all:

Code
bmx_q %>%
  filter(if_all(c(BMXWT, BMXARML, BMXTHICR), ~ .x > 30)) %>%  # all columns above 30
  slice_sample(n = 5) %>%
  kable_table()
SEQN BMDSTATS BMXWT BMIWT BMXRECUM BMIRECUM BMXHEAD BMIHEAD BMXHT BMIHT BMXBMI BMXLEG BMILEG BMXCALF BMICALF BMXARML BMIARML BMXARMC BMIARMC BMXWAIST BMIWAIST BMXTHICR BMITHICR BMXTRI BMITRI BMXSUB BMISUB
39400 Complete data for age group 73 NA NA NA NA NA 166 NA 27 37 NA 40 NA 39 NA 30 NA 96 NA 50 NA 20 NA 19.4 NA
40159 Other partial exam 74 NA NA NA NA NA 162 NA 28 36 NA 40 NA 34 NA 33 NA 94 NA 54 NA 33 NA NA Could not obtain
38581 Complete data for age group 36 NA NA NA NA NA 151 NA 16 33 NA 30 NA 31 NA 21 NA 60 NA 38 NA 14 NA 7.2 NA
31668 Complete data for age group 49 NA NA NA NA NA 153 Not straight 21 37 NA 32 NA 36 NA 23 NA 78 NA 43 NA 10 NA 7.6 NA
40802 Complete data for age group 93 NA NA NA NA NA 162 NA 36 37 NA 44 NA 38 NA 38 NA 115 NA 59 NA 34 NA 35.8 NA

Note: filter() drops rows where the condition is NA, so include is.na() checks if you want to keep missing values.

Code
bmx_q %>%
  filter(is.na(BMXWT) | BMXWT > 60) %>%  # keep missing weights or weights > 60
  kable_table(n = 6)
SEQN BMDSTATS BMXWT BMIWT BMXRECUM BMIRECUM BMXHEAD BMIHEAD BMXHT BMIHT BMXBMI BMXLEG BMILEG BMXCALF BMICALF BMXARML BMIARML BMXARMC BMIARMC BMXWAIST BMIWAIST BMXTHICR BMITHICR BMXTRI BMITRI BMXSUB BMISUB
31129 Complete data for age group 75 NA NA NA NA NA 168 NA 27 43 NA 41 NA 36 NA 33 NA 98 NA 56 NA 19 NA 18 NA
31130 No body measures exam data NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
31131 Other partial exam 75 NA NA NA NA NA 156 NA 31 38 NA 37 NA 35 NA 36 NA 96 NA 54 NA NA Exceeds capacity NA Could not obtain
31132 Complete data for age group 70 NA NA NA NA NA 168 NA 25 40 NA 36 NA 38 NA 31 NA 96 NA 48 NA 10 NA 22 NA
31134 Complete data for age group 102 NA NA NA NA NA 182 NA 31 42 NA 43 NA 43 NA 33 NA 117 NA 50 NA 13 NA 16 NA
31137 Other partial exam 80 NA NA NA NA NA 170 NA 28 45 NA 41 NA 38 NA 32 NA 89 NA 60 NA 30 NA NA Could not obtain

You can also take slices directly:

Code
demo_q %>%
  slice_max(RIDAGEYR, n = 3, with_ties = FALSE) %>%  # top 3 rows by age
  kable_table()
SEQN SDDSRVYR RIDSTATR RIDEXMON RIAGENDR RIDAGEYR RIDAGEMN RIDAGEEX RIDRETH1 DMQMILIT DMDBORN DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2 DMDSCHOL DMDMARTL DMDHHSIZ DMDFMSIZ INDHHINC INDFMINC INDFMPIR RIDEXPRG DMDHRGND DMDHRAGE DMDHRBRN DMDHREDU DMDHRMAR DMDHSEDU SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANG WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA
31130 NHANES 2005-2006 Public Release Both Interviewed and MEC examined May 1 through October 31 Female 85 NA NA Non-Hispanic White No Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA Some College or AA degree NA Widowed 1 1 $15,000 to $19,999 $15,000 to $19,999 1.99 NA Female 85 Born in 50 US States or Washington, DC Some College or AA degree Widowed NA English No No English No No NA NA NA NA 29961 34031 2 46
31149 NHANES 2005-2006 Public Release Both Interviewed and MEC examined May 1 through October 31 Female 85 NA NA Non-Hispanic White No Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA 9-11th Grade (Includes 12th grade with no diploma) NA Widowed 1 1 $ 0 to $ 4,999 $ 0 to $ 4,999 0.05 NA Female 85 Born in 50 US States or Washington, DC 9-11th Grade (Includes 12th grade with no diploma) Widowed NA English No No English No No English No No English 23813 25998 2 52
31297 NHANES 2005-2006 Public Release Both Interviewed and MEC examined May 1 through October 31 Female 85 NA NA Non-Hispanic White No Born Elsewhere Not a citizen of the US 50 years or more NA High School Grad/GED or Equivalent NA Widowed 2 2 $75,000 and Over $75,000 and Over 5.00 NA Male 47 Born in 50 US States or Washington, DC High School Grad/GED or equivalent Divorced NA English No No English No No English No No English 33863 36880 1 46

Or take the first few rows in a group with slice_head():

Code
demo_q %>%
  group_by(RIAGENDR) %>%
  slice_head(n = 2) %>%
  kable_table()
SEQN SDDSRVYR RIDSTATR RIDEXMON RIAGENDR RIDAGEYR RIDAGEMN RIDAGEEX RIDRETH1 DMQMILIT DMDBORN DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2 DMDSCHOL DMDMARTL DMDHHSIZ DMDFMSIZ INDHHINC INDFMINC INDFMPIR RIDEXPRG DMDHRGND DMDHRAGE DMDHRBRN DMDHREDU DMDHRMAR DMDHSEDU SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANG WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA
31127 NHANES 2005-2006 Public Release Both Interviewed and MEC examined May 1 through October 31 Male 0 11 12 Non-Hispanic White NA Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA NA NA NA 4 4 $15,000 to $19,999 $15,000 to $19,999 0.75 NA Female 21 Born in 50 US States or Washington, DC High School Grad/GED or equivalent Married 9-11th Grade (Includes 12th grade with no diploma) English Yes No English No No NA NA NA NA 6435 6571 2 44
31129 NHANES 2005-2006 Public Release Both Interviewed and MEC examined May 1 through October 31 Male 15 189 190 Non-Hispanic Black NA Born in 50 US States or Washington, DC Citizen by birth or naturalization NA 10th Grade NA In school Never married 6 6 $65,000 to $74,999 $65,000 to $74,999 2.71 NA Male 41 Born in 50 US States or Washington, DC Some College or AA degree Married Some College or AA degree English Yes No English No No English No No English 5317 5587 1 51
31128 NHANES 2005-2006 Public Release Both Interviewed and MEC examined November 1 through April 30 Female 11 132 132 Non-Hispanic Black NA Born in 50 US States or Washington, DC Citizen by birth or naturalization NA 4th Grade NA In school NA 7 6 $45,000 to $54,999 $20,000 to $24,999 0.77 SP not pregnant at exam Male 47 Born in 50 US States or Washington, DC 9-11th Grade (Includes 12th grade with no diploma) NA NA English Yes No English No No English No No English 9082 8987 1 52
31130 NHANES 2005-2006 Public Release Both Interviewed and MEC examined May 1 through October 31 Female 85 NA NA Non-Hispanic White No Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA Some College or AA degree NA Widowed 1 1 $15,000 to $19,999 $15,000 to $19,999 1.99 NA Female 85 Born in 50 US States or Washington, DC Some College or AA degree Widowed NA English No No English No No NA NA NA NA 29961 34031 2 46

Columns: choose or reshape variables

Code
demo_q %>%
  select(SEQN, RIAGENDR, RIDAGEYR) %>%    # keep columns
  rename(gender = RIAGENDR) %>%           # rename columns
  relocate(RIDAGEYR, .before = gender) %>%    # move columns
  kable_table(n = 6)
SEQN RIDAGEYR gender
31127 0 Male
31128 11 Female
31129 15 Male
31130 85 Female
31131 44 Female
31132 70 Male

Tidyselect patterns: select columns by name or type

Code
demo_q %>%
  select(SEQN, starts_with("RID"), ends_with("YR")) %>%
  kable_table(n = 6)
SEQN RIDSTATR RIDEXMON RIDAGEYR RIDAGEMN RIDAGEEX RIDRETH1 RIDEXPRG SDDSRVYR WTINT2YR WTMEC2YR
31127 Both Interviewed and MEC examined May 1 through October 31 0 11 12 Non-Hispanic White NA NHANES 2005-2006 Public Release 6435 6571
31128 Both Interviewed and MEC examined November 1 through April 30 11 132 132 Non-Hispanic Black SP not pregnant at exam NHANES 2005-2006 Public Release 9082 8987
31129 Both Interviewed and MEC examined May 1 through October 31 15 189 190 Non-Hispanic Black NA NHANES 2005-2006 Public Release 5317 5587
31130 Both Interviewed and MEC examined May 1 through October 31 85 NA NA Non-Hispanic White NA NHANES 2005-2006 Public Release 29961 34031
31131 Both Interviewed and MEC examined May 1 through October 31 44 535 536 Non-Hispanic Black SP not pregnant at exam NHANES 2005-2006 Public Release 26458 26771
31132 Both Interviewed and MEC examined May 1 through October 31 70 842 843 Non-Hispanic White NA NHANES 2005-2006 Public Release 32962 35316
Code
bmx_q %>%
  select(where(is.numeric), matches("^BMX")) %>%
  kable_table(n = 6)
SEQN BMXWT BMXRECUM BMXHEAD BMIHEAD BMXHT BMXBMI BMXLEG BMXCALF BMXARML BMXARMC BMXWAIST BMXTHICR BMXTRI BMXSUB
31127 10 74 NA NA NA NA NA NA 16 16 NA NA 13 10.0
31128 40 NA NA NA 152 17 38 29 34 22 63 40 10 8.4
31129 75 NA NA NA 168 27 43 41 36 33 98 56 19 17.6
31130 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
31131 75 NA NA NA 156 31 38 37 35 36 96 54 NA NA
31132 70 NA NA NA 168 25 40 36 38 31 96 48 10 22.2

You can also drop columns with a leading - inside select():

Code
demo_q %>%
  select(-starts_with("RID")) %>%
  kable_table(n = 6)
SEQN SDDSRVYR RIAGENDR DMQMILIT DMDBORN DMDCITZN DMDYRSUS DMDEDUC3 DMDEDUC2 DMDSCHOL DMDMARTL DMDHHSIZ DMDFMSIZ INDHHINC INDFMINC INDFMPIR DMDHRGND DMDHRAGE DMDHRBRN DMDHREDU DMDHRMAR DMDHSEDU SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANG WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA
31127 NHANES 2005-2006 Public Release Male NA Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA NA NA NA 4 4 $15,000 to $19,999 $15,000 to $19,999 0.75 Female 21 Born in 50 US States or Washington, DC High School Grad/GED or equivalent Married 9-11th Grade (Includes 12th grade with no diploma) English Yes No English No No NA NA NA NA 6435 6571 2 44
31128 NHANES 2005-2006 Public Release Female NA Born in 50 US States or Washington, DC Citizen by birth or naturalization NA 4th Grade NA In school NA 7 6 $45,000 to $54,999 $20,000 to $24,999 0.77 Male 47 Born in 50 US States or Washington, DC 9-11th Grade (Includes 12th grade with no diploma) NA NA English Yes No English No No English No No English 9082 8987 1 52
31129 NHANES 2005-2006 Public Release Male NA Born in 50 US States or Washington, DC Citizen by birth or naturalization NA 10th Grade NA In school Never married 6 6 $65,000 to $74,999 $65,000 to $74,999 2.71 Male 41 Born in 50 US States or Washington, DC Some College or AA degree Married Some College or AA degree English Yes No English No No English No No English 5317 5587 1 51
31130 NHANES 2005-2006 Public Release Female No Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA Some College or AA degree NA Widowed 1 1 $15,000 to $19,999 $15,000 to $19,999 1.99 Female 85 Born in 50 US States or Washington, DC Some College or AA degree Widowed NA English No No English No No NA NA NA NA 29961 34031 2 46
31131 NHANES 2005-2006 Public Release Female No Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA Some College or AA degree NA Married 4 4 $75,000 and Over $75,000 and Over 4.65 Male 36 Born in 50 US States or Washington, DC College Graduate or above Married Some College or AA degree English No No English No No English No No English 26458 26771 1 48
31132 NHANES 2005-2006 Public Release Male Yes Born in 50 US States or Washington, DC Citizen by birth or naturalization NA NA College Graduate or above NA Married 2 2 $75,000 and Over $75,000 and Over 5.00 Male 70 Born in 50 US States or Washington, DC College Graduate or above Married College Graduate or above English No No English No No English No No English 32962 35316 2 52

Conditional transforms: create variables with if_else() and case_when()

Code
demo_q %>%
  mutate(
    adult = if_else(RIDAGEYR >= 18, "adult", "minor"),
    age_group = case_when(
      RIDAGEYR < 18 ~ "child",
      RIDAGEYR < 65 ~ "adult",
      TRUE ~ "older adult"
    )
  ) %>%
  select(SEQN, RIDAGEYR, age_group) %>%
  kable_table(n = 6)
SEQN RIDAGEYR age_group
31127 0 child
31128 11 child
31129 15 child
31130 85 older adult
31131 44 adult
31132 70 older adult

Mutate across: apply the same transformation to multiple columns

Code
bmx_q %>%
  mutate(
    across(
      c(BMXWT, BMXARML, BMXTHICR),
      ~ .x - mean(.x, na.rm = TRUE),
      .names = "centered_{.col}"
    )
  ) %>%
  dplyr::select(SEQN, starts_with("centered_")) %>%
  kable_table(n = 6)
SEQN centered_BMXWT centered_BMXARML centered_BMXTHICR
31127 -49.8 -16.7 NA
31128 -19.9 1.6 -11.8
31129 14.6 3.8 4.6
31130 NA NA NA
31131 15.2 2.3 2.4
31132 9.5 4.8 -3.3

Groups: summarize or mutate within groups

group_by() can use multiple keys (e.g., group_by(BTN, COMPANY)) and ungroup() drops grouping when you’re done.

Code
demo_q %>%
  group_by(RIAGENDR) %>%
  summarize(
    n = n(),
    avg_age = mean(RIDAGEYR, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  kable_table()
RIAGENDR n avg_age
Male 5080 28
Female 5268 28

Grouping persists across verbs until you ungroup() (or use .groups in summarize()):

Code
demo_q %>%
  group_by(RIAGENDR) %>%
  mutate(mean_age = mean(RIDAGEYR, na.rm = TRUE)) %>%
  ungroup() %>%
  summarize(overall_mean = mean(RIDAGEYR, na.rm = TRUE)) %>%
  kable_table()
overall_mean
28

Quick counting helpers:

Code
demo_q %>%
  count(RIAGENDR)                      # counts per gender
RIAGENDR n
Male 5080
Female 5268
Code
demo_q %>%
  summarize(n_edu = n_distinct(DMDHREDU))
n_edu
8

For single summaries, you can also use .by instead of group_by():

Code
demo_q %>%
  summarize(
    avg_age = mean(RIDAGEYR, na.rm = TRUE),
    .by = RIAGENDR # compute summary within gender without keeping groups
  )
RIAGENDR avg_age
Male 28
Female 28