Data wrangling in dplyr

Author

Michael Hallquist, PSYC 859

Published

March 12, 2026

The goal of this document is to provide a basic introduction to data wrangling using functions from the so-called ‘tidyverse’ approach. The tidyverse (https://www.tidyverse.org) is a set of data science packages in R that are intended to provide a consistent paradigm for working with data. This approach unifies a previously inchoate landscape of different functions and packages in R that could be daunting to new users.

Although I do not claim that the tidyverse approach is best according to all possible criteria, I believe that it is the best paradigm for working with data in R for social scientists, many of whom do not have a formal background in computer programming.

Here, I will draw primarily from the tidyr and dplyr packages in R.

For an excellent book-length treatment of the tidyverse approach, see R for Data Science (2nd edition) by Hadley Wickham, Mine Cetinkaya-Rundel, and Garrett Grolemund.

Before we start: beware namespace collisions!

One of the most irritating problems you may encounter in the tidyverse world (and more generally, in R) is when code that previously worked suddenly throws an inexplicable error.

For example:

> survey %>% group_by(R_exp) %>% 
summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))

Error in summarize(., m_age = mean(Psych_age_yrs), sd_age = sd(Psych_age_yrs)) : 
argument "by" is missing, with no default

By using fairly intuitive data wrangling verbs such as ‘summarize’ and ‘select’, dplyr (and sometimes tidyr) sometimes uses the same function names as other packages. For example, Hmisc has a summarize function that does not operate in the same way as summarize in dplyr. Also, the predecessor to dplyr was called plyr. Although largely outmoded, it has a few remaining functions that may be useful. But… many of these functions have the same names in dplyr but operate differently (the syntax is not the same!), which can be a common source of collisions when using dplyr.

This points to the problem of what are called ‘namespace collisions.’ That is, when R looks for a function (or any object) in the global environment, it searches through a ‘path’. You can see the nitty gritty using searchpaths(). But the TL;DR is that if you – or any function you call on – loads another package, that package may override a dplyr function and make your code crash!

Addressing namespace collisions

Watch out for warnings about objects being ‘masked’ when packages are loaded.
Explicitly specify the package where your desired function lives using the double colon operator. Example: dplyr::summarize.
Try to load tidyverse packages using library(tidyverse). This handles collisions within the tidyverse!

For example, in this document, multilevel pulls in MASS, which masks dplyr::select. If we were to load dplyr first, then multilevel, we would see the following:

> library(dplyr)
> library(multilevel)
Loading required package: nlme

Attaching package: ‘nlme’

The following object is masked from ‘package:dplyr’:

    collapse

Loading required package: MASS

Attaching package: ‘MASS’

The following object is masked from ‘package:dplyr’:

    select

Read in English, this says, “The multilevel package needs to load nlme (a mixed-effects package). Nlme has a function called ‘collapse’ that will now take precedence over ‘collapse’ in dplyr if you run collapse() in your session. Multilevel also needs the MASS package, so I’m loading it. The MASS package has a function called ‘select’ that now takes precedence over dplyr’s select function, so if you call select() in your session, you will be getting the MASS package function, not the dplyr one.”

To get around these problems, the easiest route is to load dplyr (and other tidyverse packages) last in the chain, giving them precedence over other packages. As you see above in the p_load call, we load dplyr last in the chain so that its select function is preferred over MASS. This is important because functions of the same name across packages may have completely different purposes and syntax!

The slightly harder, but more robust, route is always to use explicit namespace qualifiers with the :: operator. If you use dplyr::select as the function call, you are explicitly telling R that you want dplyr’s select function, not some other package’s.

The two most common collisions are with select (MASS) and summarize (Hmisc). If you’re not sure which version of a function has taken precedence you can type the function name without parentheses. This will print the function source (which you can ignore). But take a look at the last line, which reveals where R ‘found’ the function in the search across loaded packages (based on their load order/precedence).

> select
function (obj) 
UseMethod("select")
<bytecode: 0x93b18ebd8>
<environment: namespace:MASS>

If you were expecting that last line to read <environment: namespace:dplyr> (i.e. dplyr’s version), then you have a namespace problem to resolve!

Data pipelines

The tidyverse paradigm encourages the use of so-called data pipelines when writing the syntax for a multi-step data transformation procedure. The pipe operator %>% is provided by the magrittr package, which is loaded by dplyr. Data pipeline syntax is intended to provide a readable syntax for the order in which data operations are performed. You can think of this as a recipe for data wrangling. For example:

1. Read data from the file: 'mydata.csv'
2. Rename SDATE to submission_date; rename ed to education
3. Only keep subject IDs above 10 (the first 9 were in-lab pilot subjects)
4. Drop the timestamp and survey_settings variables
5. Compute the log of income as a new variable called log_income
6. Arrange the data by ID, then submission_date (where each subject submitted many surveys)
7. Ensure that ID and submission_date are the left-most columns in the data

We would write this out as a dplyr pipeline using the pipe operator %>% to chain together data operations.

dataset <- read.csv("mydata.csv") %>%
  rename(submission_date=SDATE, education=ed) %>%
  filter(ID < 10) %>%
  select(-timestamp, -survey_settings) %>%
  mutate(log_income = log(income)) %>%
  arrange(ID, submission_date) %>%
  select(ID, submission_date, everything())

Use of ‘this’ reference in tidyverse

Sometimes it is useful to refer to the current dataset or variable explicitly in tidyverse data wrangling syntax.

dplyr/magrittr tends to hide this from us for convenience, but it’s there under the hood.

iris %>% filter(Sepal.Length > 7)

is the same as

iris %>% filter(., Sepal.Length > 7)

So, '.' refers to the current dataset or variable (depending on context) in dplyr operations. And if you don’t specify where the '.' falls in your syntax, it will always be passed as the first argument to the downstream function.

Special values to watch out for in data wrangling

NA: missing
- na.rm=TRUE available in many functions
- Also, see na.omit(), na.exclude(), na.fail(), na.pass()
NULL: null set
- Often used when something is undefined
Inf: infinite
NaN: Not a number.
- Result of an invalid computation, e.g., log(-1)
warnings(): If R mentions a warning in data wrangling, make sure you’ve handled it or know its origin. Don’t just ignore these!

dplyr fundamentals

The dplyr package is at the core of data wrangling in the tidyverse, providing a set of wrangling verbs that collectively form a coherent language for data transformations.

Overview: a first pass through an NHANES dataset

To learn dplyr, let’s start with a survey from the National Health and Nutrition Examination Survey (NHANES) dataset. These data are provided in the nhanesA package. We’ll start by looking at a couple of basic demographic variables.

NHANES variables used below (from DEMO_D and BMX_D):

SEQN: respondent ID
RIAGENDR: gender (coded numeric in raw NHANES)
RIDAGEYR: age in years
DMDHREDU: education level
BMXWT: weight (kg)
BMXARML: upper arm length (cm)
BMXTHICR: thigh circumference (cm)

Code

# demographics: note that this code relies on functions in the nhanesA package, which
#   is useful for working with NHANES data, but is not a focus of this workshop per se.

demo_d <- nhanes('DEMO_D')
demo_d_vars  <- nhanesTableVars('DEMOGRAPHICS', 'DEMO_D', namesonly=TRUE)

# translate numeric codes into factors by using the nhanes lookup/codebook functions
demo_d <- suppressWarnings(nhanesTranslate('DEMO_D', demo_d_vars, data=demo_d))

Translated columns: AIALANG DMDBORN DMDCITZN DMDEDUC2 DMDEDUC3 DMDHRBRN DMDHREDU DMDHRGND DMDHRMAR DMDHSEDU DMDMARTL DMDSCHOL DMDYRSUS DMQMILIT FIAINTRP FIALANG FIAPROXY INDFMINC INDHHINC MIAINTRP MIALANG MIAPROXY RIAGENDR RIDEXMON RIDEXPRG RIDRETH1 RIDSTATR SDDSRVYR SIAINTRP SIALANG SIAPROXY

Code

# dplyr pipeline
demo_d <- demo_d %>% 
  filter(!INDHHINC %in% c("Over $20,000", "Under $20,000", "Refused", "Don't know")) %>% 
  droplevels() %>% # drop unused factor levels from the data.frame
  mutate(
    # case_when() is a vectorized if/else ladder for many conditions
    income_num = case_when( # convert range-based factor labels to midpoint numbers
    INDHHINC == "$     0 to $ 4,999" ~ 2500,
    INDHHINC == "$ 5,000 to $ 9,999" ~ 7500,
    INDHHINC == "$10,000 to $14,999" ~ 12500,
    INDHHINC == "$15,000 to $19,999" ~ 17500,
    INDHHINC == "$20,000 to $24,999" ~ 22500,
    INDHHINC == "$25,000 to $34,999" ~ 30000,
    INDHHINC == "$35,000 to $44,999" ~ 40000,
    INDHHINC == "$45,000 to $54,999" ~ 50000,
    INDHHINC == "$55,000 to $64,999" ~ 60000,
    INDHHINC == "$65,000 to $74,999" ~ 70000,
    INDHHINC == "$75,000 and Over" ~ 80000
    )
  )
 
#load body (biometric) measures
bmx_d <- nhanes('BMX_D')
bmx_d_vars  <- nhanesTableVars('EXAM', 'BMX_D', namesonly=TRUE)
bmx_d <- suppressWarnings(nhanesTranslate('BMX_D', bmx_d_vars, data=bmx_d))

Translated columns: BMDSTATS BMIARMC BMIARML BMICALF BMIHT BMILEG BMIRECUM BMISUB BMITHICR BMITRI BMIWAIST BMIWT

Code

# merge the education demographics variable with the biometric data, joining on the SEQN column (basically an ID)
bmx_d <- demo_d %>% 
  dplyr::select(SEQN, DMDHREDU) %>% 
  inner_join(bmx_d, by="SEQN")

group_by + summarize

Let’s summarize the mean household income (income_num), converted from categories in INDHHINC) by highest level of education completed (DMDHREDU)

Code

demo_d %>% 
  group_by(DMDHREDU) %>% # divide dataset into separate compartments by education level
  dplyr::summarize(
    n=n(), # number of observations
    m_income=mean(income_num, na.rm=T), 
    sd_income=sd(income_num, na.rm=T)
  ) %>% 
  kable_table()

DMDHREDU	n	m_income	sd_income
Less Than 9th Grade	1173	28024	18489
9-11th Grade (Includes 12th grade with no diploma)	1626	31318	21584
High School Grad/GED or equivalent	2317	39400	23422
Some College or AA degree	2664	48137	25326
College Graduate or above	1768	63785	21816
Refused	6	25000	12624
Don't know	26	20900	15760
NA	276	44623	24849

Note that summarize removes a single level of grouping in the group_by process. Here, we only have one grouping variable, DMDHREDU, so the output of summarize will be ‘ungrouped.’

Grouped summaries of several variables

What if I want to have means and SDs for several continuous variables grouped by highest education? Let’s look specifically at the weight (BMXWT), upper arm length (BMXARML), and thigh circumference (BMXTHICR) measurements. The combination of summarize and across provide functionality to specify several variables using the .cols argument of across and potentially several summary functions by passing them in a named list.

Code

bmx_d %>% 
  group_by(DMDHREDU) %>% 
  dplyr::summarize(
    across(
      c(BMXWT, BMXARML, BMXTHICR), 
      list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T))
    )
  ) %>% 
  kable_table()

DMDHREDU	BMXWT_m	BMXWT_sd	BMXARML_m	BMXARML_sd	BMXTHICR_m	BMXTHICR_sd
Less Than 9th Grade	58	30	32	7.8	50	7.48
9-11th Grade (Includes 12th grade with no diploma)	56	33	32	8.2	51	8.40
High School Grad/GED or equivalent	61	33	33	7.9	52	8.10
Some College or AA degree	62	32	33	7.6	52	8.65
College Graduate or above	61	31	33	7.8	52	7.99
Refused	70	25	38	2.5	50	0.57
Don't know	50	36	28	11.5	53	7.74
NA	56	29	32	7.6	51	7.73

Let’s slow this down:

group_by verb

survey %>% group_by(DMDHREDU)

This tells dplyr to divide the bmx_d (NHANES biometric measurements) data into a set of smaller data.frame objects, one per level of DMDHREDU. Internally, this looks something like the output below. After this division of the dataset into chunks, summarize will work on each chunk individually.

$`Less Than 9th Grade`
   SEQN            DMDHREDU                    BMDSTATS BMXWT
1 31137 Less Than 9th Grade          Other partial exam    80
2 31157 Less Than 9th Grade Complete data for age group    42
3 31169 Less Than 9th Grade Complete data for age group    22
4 31175 Less Than 9th Grade Complete data for age group    86

$`9-11th Grade (Includes 12th grade with no diploma)`
   SEQN                                           DMDHREDU
1 31128 9-11th Grade (Includes 12th grade with no diploma)
2 31133 9-11th Grade (Includes 12th grade with no diploma)
3 31145 9-11th Grade (Includes 12th grade with no diploma)
4 31148 9-11th Grade (Includes 12th grade with no diploma)
                     BMDSTATS BMXWT
1 Complete data for age group    40
2 Complete data for age group    45
3 Complete data for age group    40
4 Complete data for age group    52

$`High School Grad/GED or equivalent`
   SEQN                           DMDHREDU                    BMDSTATS BMXWT
1 31127 High School Grad/GED or equivalent Complete data for age group    10
2 31139 High School Grad/GED or equivalent          Other partial exam    74
3 31140 High School Grad/GED or equivalent Complete data for age group    42
4 31143 High School Grad/GED or equivalent Complete data for age group    76

$`Some College or AA degree`
   SEQN                  DMDHREDU                    BMDSTATS BMXWT
1 31129 Some College or AA degree Complete data for age group    75
2 31130 Some College or AA degree  No body measures exam data    NA
3 31138 Some College or AA degree Complete data for age group    14
4 31142 Some College or AA degree          Other partial exam    80

$`College Graduate or above`
   SEQN                  DMDHREDU                    BMDSTATS BMXWT
1 31131 College Graduate or above          Other partial exam    75
2 31132 College Graduate or above Complete data for age group    70
3 31135 College Graduate or above Complete data for age group    10
4 31141 College Graduate or above Complete data for age group    60

$Refused
   SEQN DMDHREDU                             BMDSTATS BMXWT
1 33385  Refused          Complete data for age group    78
2 35296  Refused Partial:  Height and weight obtained    38
3 37761  Refused          Complete data for age group    67
4 40762  Refused Partial:  Height and weight obtained    99

$`Don't know`
   SEQN   DMDHREDU                    BMDSTATS BMXWT
1 31546 Don't know          Other partial exam    78
2 32106 Don't know Complete data for age group    12
3 32626 Don't know Complete data for age group    64
4 32648 Don't know Complete data for age group    85

summarize + across

The summarize function transforms a dataset that has many rows to a dataset that has a single row per grouping unit. If you do not use group_by, summarize will yield an overall summary statistic in the entire dataset. For example, to get the mean and SD of household income in NHANES, irrespective of education level or other categorical moderators, we could just use a simple summarize:

Code

demo_d %>%
  dplyr::summarize(
    m_income=mean(income_num, na.rm=T), 
    sd_income=sd(income_num, na.rm=T)
  ) %>% 
  kable_table()

m_income	sd_income
43590	25726

But because we used group_by(DMDHREDU) above, we got unique summaries of the variables at each level of education.

The across function accepts two primary arguments. First, we specify a set of variables (the .cols argument) that we wish to summarize in the same way (i.e., compute the same summary statistics). Second, we specify which statistics we wish to compute (the .fns argument). In our case, the syntax was:

dplyr::summarize(
    across(c(BMXWT, BMXARML, BMXTHICR), 
           list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T)))
)

The c() function specifies a vector of unquoted variable names in the dataset we wish to summarize, separated by commas. Note that any tidyselect operator for selecting columns will work. This includes starts_with, ends_with, matches, and others. For details, see ?dplyr_tidy_select.

The list() here asks dplyr to run each function in the list against each variables in the .cols specification. Here, this means that dplyr will compute the mean and SD of each variable in the .cols argument at each level of education completed (the group_by basis). The names of the list elements (left side) — here, m and sd — become the suffixes added for each variable. The value of the element (right side) — here, mean and sd — are the functions that should be used to compute a summary statistic (they should return one number per grouped variable).

Note that the column names that result from an across operation can be modified using the .names argument. The default is to suffix the variable name with the names of the functions list, preceded by an underscore (e.g., income_m).

Passing arguments to summary functions

Notice how we passed na.rm=TRUE to the mean function within the list. This tells the mean to ignore missing (NA) values when computing the mean (i.e., mean of the non-missing numbers). In general, the dplyr syntax using what they call “lambdas” (starting with ~) is the clearest way to control the arguments passed to each function. Here is a simple definition of a lambda in R:

~ mean(.x, na.rm=TRUE)

The .x refers to the current variable being used within a dplyr data wrangling operation. This is in contrast to ., which generally refers to the current dataset.

If you don’t need to pass arguments to the functions in an across() operation, you can just state the function name:

dplyr::summarize(
    across(c(BMXWT, BMXARML, BMXTHICR), 
           list(m=mean, sd=sd))
)

Making a summarize pipeline even more beautiful

We can also make the output more beautiful using tidying techniques we’ve already seen in the tidyr tutorial. Remember that R is all about programming for data science. In particular, notice that we have some columns that are means and others that are SDs.

We can just extend our data pipeline a bit. The extract function from tidyr here is like separate, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a useful tutorial here: http://www.regular-expressions.info/tutorial.html.

Code

bmx_d %>% 
  group_by(DMDHREDU) %>% 
  dplyr::summarize(
    across(
      c(BMXWT, BMXARML, BMXTHICR), 
      list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T))
    )
  ) %>%
  
   # combine m and sd statistics (notice how you can add comments inline within a pipeline?)
  pivot_longer(cols=-DMDHREDU, names_to = "Measure", values_to = "value") %>%
  
  # divide income_m into income and m
  #extract(col=Measure, into=c("bio_measure", "statistic"), regex=("(.*)_(.*)$")) %>% 
  separate(col=Measure, into=c("bio_measure", "statistic"), sep="_") %>% 
  pivot_wider(names_from=statistic, values_from = value) %>% 
  arrange (bio_measure, DMDHREDU) %>%
  kable_table()

DMDHREDU	bio_measure	m	sd
Less Than 9th Grade	BMXARML	32	7.77
9-11th Grade (Includes 12th grade with no diploma)	BMXARML	32	8.22
High School Grad/GED or equivalent	BMXARML	33	7.86
Some College or AA degree	BMXARML	33	7.61
College Graduate or above	BMXARML	33	7.81
Refused	BMXARML	38	2.47
Don't know	BMXARML	28	11.49
NA	BMXARML	32	7.62
Less Than 9th Grade	BMXTHICR	50	7.48
9-11th Grade (Includes 12th grade with no diploma)	BMXTHICR	51	8.40
High School Grad/GED or equivalent	BMXTHICR	52	8.10
Some College or AA degree	BMXTHICR	52	8.65
College Graduate or above	BMXTHICR	52	7.99
Refused	BMXTHICR	50	0.57
Don't know	BMXTHICR	53	7.74
NA	BMXTHICR	51	7.73
Less Than 9th Grade	BMXWT	58	30.15
9-11th Grade (Includes 12th grade with no diploma)	BMXWT	56	33.44
High School Grad/GED or equivalent	BMXWT	61	32.70
Some College or AA degree	BMXWT	62	32.48
College Graduate or above	BMXWT	61	31.30
Refused	BMXWT	70	25.08
Don't know	BMXWT	50	36.00
NA	BMXWT	56	29.12

arrange: order observations

Toward the end of the pipeline above, we see:

arrange (bio_measure, DMDHREDU) %>%

The arrange verb in dplyr requests that observations be sorted according to one or more variables. Here, we ask for the dataset to be sorted by bio_measure (biometric measure, such as weight) first, then by education level within that measure. The arrange verb sorts observations in ascending order (low to high) by default, but data can be sorted in descending order using the desc() function:

arrange(bio_measure, desc(DMDHREDU)) %>%

This would sort by highest to lowest education level within each measure (bio_measure), where the bio_measures are still in ascending (alphabetical) order

Examining a more complex multilevel dataset using dplyr

Let’s examine the univbct data, which contains longitudinal observations of job satisfaction, commitment, and readiness to deploy. From the documentation ?univbct:

This data set contains the complete data set used in Bliese and Ployhart (2002). The data is longitudinal data converted to univariate (i.e., stacked) form. Data were collected at three time points. A data frame with 22 columns and 1485 observations from 495 individuals.

We have 1485 observations of military personnel nested within companies, which are nested within batallions: https://en.wikipedia.org/wiki/Battalion.

Code

data(univbct, package="multilevel")
str(univbct)

'data.frame':   1485 obs. of  22 variables:
 $ BTN    : num  1022 1022 1022 1004 1004 ...
 $ COMPANY: Factor w/ 8 levels "A","B","C","D",..: 6 6 6 4 4 4 2 2 2 2 ...
 $ MARITAL: num  1 1 1 4 4 4 2 2 2 2 ...
 $ GENDER : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HOWLONG: num  2 2 2 0 0 0 0 0 0 1 ...
 $ RANK   : num  12 12 12 13 13 13 15 15 15 14 ...
 $ EDUCATE: num  2 2 2 2 2 2 2 2 2 2 ...
 $ AGE    : num  20 20 20 24 24 24 24 24 24 23 ...
 $ JOBSAT1: num  1.67 1.67 1.67 3.67 3.67 ...
 $ COMMIT1: num  1.67 1.67 1.67 1.67 1.67 ...
 $ READY1 : num  2.75 2.75 2.75 3 3 3 3.75 3.75 3.75 2.5 ...
 $ JOBSAT2: num  1 1 1 4 4 ...
 $ COMMIT2: num  1.67 1.67 1.67 1.33 1.33 ...
 $ READY2 : num  1 1 1 2 2 2 3.75 3.75 3.75 3.25 ...
 $ JOBSAT3: num  3 3 3 4 4 4 4 4 4 3 ...
 $ COMMIT3: num  3 3 3 1.33 1.33 ...
 $ READY3 : num  3 3 3 1.75 1.75 1.75 1.75 1.75 1.75 3 ...
 $ TIME   : num  0 1 2 0 1 2 0 1 2 0 ...
 $ JSAT   : num  1.67 1 3 3.67 4 ...
 $ COMMIT : num  1.67 1.67 3 1.67 1.33 ...
 $ READY  : num  2.75 1 3 3 2 1.75 3.75 3.75 1.75 2.5 ...
 $ SUBNUM : num  1 1 1 2 2 2 3 3 3 4 ...

Let’s enact the core ‘verbs’ of dplyr to understand and improve the structure of these data.

filter: obtaining observations (rows) based on some criteria

Objective: Retain only men in company A

Code

COMPANY="A"
COMPANY <- "A"

company_A_men <- filter(univbct, COMPANY=="A" & GENDER==1)
#print 10 observations at random to check the accuracy of the filter
#p=11 just shows the first 11 columns to keep it on one page for formatting
company_A_men %>% sample_n(10) %>% kable_table(p=11)

	BTN	COMPANY	MARITAL	GENDER	HOWLONG	RANK	EDUCATE	AGE	JOBSAT1	COMMIT1	READY1
232	1022	A	2	1	2	15	2	29	3.7	3.3	3.2
1373	404	A	2	1	1	16	4	36	4.0	4.0	3.2
1395	3066	A	2	1	2	15	2	26	4.0	4.0	3.8
1235	3066	A	2	1	2	16	2	34	3.0	3.7	3.0
1420	3066	A	2	1	3	13	3	21	3.0	3.7	2.8
1208	4	A	2	1	3	15	2	28	2.3	2.7	1.8
233	1022	A	2	1	2	15	2	29	3.7	3.3	3.2
1159	299	A	2	1	3	14	2	24	2.7	3.0	2.5
445	1022	A	1	1	1	14	2	23	3.7	3.3	3.5
616	4042	A	1	1	2	13	2	19	1.3	3.3	2.5

Objective: Count how many people are in companies A and B

Code

filter(univbct, COMPANY %in% c("A","B")) %>% nrow()

[1] 750

Objective: What about counts by company and battalion?

Code

univbct %>% 
  group_by(BTN, COMPANY) %>% 
  tally() %>%
  kable_table(n=12)

BTN	COMPANY	n
4	A	66
4	B	15
4	C	12
4	D	30
4	HHC	18
104	A	12
104	HHC	3
124	A	42
144	A	30
299	A	39
299	B	30
299	C	27

Code

# N.B. The same result could be obtained with count(BTN, COMPANY) alone.
#  This combines the group_by and tally functions

select: choose variables (columns) based on some criteria

Let’s start by keeping only the three core dependent variables over time: jobsat, commit, ready. Keep SUBNUM as well for unique identification.

Code

dvs_only <- univbct %>% 
  dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                COMMIT1, COMMIT2, COMMIT3, 
                READY1, READY2, READY3)

If you have many variables of a similar name, you might try starts_with(). Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.

Code

dvs_only <- univbct %>% 
  dplyr::select(SUBNUM, starts_with("JOBSAT"), starts_with("COMMIT"), starts_with("READY"))

Other selection mechanisms:

contains: variable name contains a literal string
starts_with: variable names start with a string
ends_with: variable names end with a string
num_range: variables that have a common prefix (e.g., ‘reasoning’) and a numeric range (e.g., 1-20)
matches: variable name matches a regular expression
one_of: variable is one of the elements in a character vector. Example: select(one_of(c(“A”, “B”)))

See ?select_helpers for more details.

select + filter: zooming in on specific observations and variables

Note that select and filter can be combined to subset both observations and variables of interest.

For example, look at readiness to deploy in battalion 299 only:

Code

univbct %>% 
  filter(BTN==299) %>% 
  dplyr::select(SUBNUM, READY1, READY2, READY3) %>% 
  kable_table(n=6)

	SUBNUM	READY1	READY2	READY3
10	4	2.5	3.2	3.0
11	4	2.5	3.2	3.0
12	4	2.5	3.2	3.0
19	7	2.0	1.8	1.2
20	7	2.0	1.8	1.2
21	7	2.0	1.8	1.2

select is also useful for dropping variables that are not of interest using a kind of subtraction syntax.

Code

nojobsat <- univbct %>% 
  dplyr::select(-starts_with("JOBSAT"))
names(nojobsat)

 [1] "BTN"     "COMPANY" "MARITAL" "GENDER"  "HOWLONG" "RANK"    "EDUCATE"
 [8] "AGE"     "COMMIT1" "READY1"  "COMMIT2" "READY2"  "COMMIT3" "READY3" 
[15] "TIME"    "JSAT"    "COMMIT"  "READY"   "SUBNUM"

mutate: add one or more variables that are a function of other variables

(Row-wise) mean of commit scores over waves. Note how you can used select() within a mutate to run a function on a subset of the data.

Code

univbct <- univbct %>% 
  mutate(commitmean=rowMeans(dplyr::select(., COMMIT1, COMMIT2, COMMIT3)))

Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the starts_with function here within the across(). This operates identically to select, but in the context of a summary or mutation operation on specific variables. See ?select_helpers for details.

Code

meancent <- function(x) { x - mean(x, na.rm=TRUE) } #simple worker function to mean center a variable

univbct <- univbct %>% 
  mutate(across(starts_with("COMMIT", ignore.case = FALSE), list(cm=meancent, pct=percent_rank)))

univbct %>%
  dplyr::select(starts_with("COMMIT", ignore.case = FALSE)) %>%
  kable_table(n=8) %>% kable_styling(font_size = 12)

COMMIT1	COMMIT2	COMMIT3	COMMIT	COMMIT1_cm	COMMIT1_pct	COMMIT2_cm	COMMIT2_pct	COMMIT3_cm	COMMIT3_pct	COMMIT_cm	COMMIT_pct
1.7	1.7	3.0	1.7	-1.95	0.01	-1.80	0.04	-0.54	0.12	-1.87	0.02
1.7	1.7	3.0	1.7	-1.95	0.01	-1.80	0.04	-0.54	0.12	-1.87	0.02
1.7	1.7	3.0	3.0	-1.95	0.01	-1.80	0.04	-0.54	0.12	-0.54	0.15
1.7	1.3	1.3	1.7	-1.95	0.01	-2.13	0.03	-2.20	0.01	-1.87	0.02
1.7	1.3	1.3	1.3	-1.95	0.01	-2.13	0.03	-2.20	0.01	-2.21	0.01
1.7	1.3	1.3	1.3	-1.95	0.01	-2.13	0.03	-2.20	0.01	-2.21	0.01
3.3	3.3	3.7	3.3	-0.28	0.27	-0.13	0.33	0.13	0.45	-0.21	0.30
3.3	3.3	3.7	3.3	-0.28	0.27	-0.13	0.33	0.13	0.45	-0.21	0.30

arrange: reorder observations in specific order

Order data by ascending battalion, company, then subnum

Code

univbct <- univbct %>% 
  arrange(BTN, COMPANY, SUBNUM)

Descending sort: descending battalion, ascending company, ascending subnum

Code

univbct <- univbct %>% 
  arrange(desc(BTN), COMPANY, SUBNUM)

A more realistic example: preparation for multilevel analysis

In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).

We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -1 as the sep argument to separate splits the string at the second-to-last position (i.e., starting at the right).

For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’ In other words, there is a peculiar redundancy in the data that is altogether confusing.

Take a look:

Code

univbct %>%
  dplyr::select(SUBNUM, starts_with("JOBSAT"), JSAT) %>% 
  kable_table(n=12)

	SUBNUM	JOBSAT1	JOBSAT2	JOBSAT3	JSAT
319	103	2.0	2.3	3.3	2.0
320	103	2.0	2.3	3.3	2.3
321	103	2.0	2.3	3.3	3.3
397	129	3.7	4.3	4.7	3.7
398	129	3.7	4.3	4.7	4.3
399	129	3.7	4.3	4.7	4.7
523	171	3.7	4.0	NA	3.7
524	171	3.7	4.0	NA	4.0
525	171	3.7	4.0	NA	NA
616	202	1.3	2.0	4.3	1.3
617	202	1.3	2.0	4.3	2.0
618	202	1.3	2.0	4.3	4.3

We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).

Code

univbct <- univbct %>% 
  group_by(SUBNUM) %>% # split into separate groups for each subject
  filter(row_number() == 1) %>% # only retain the first row of each subject
  dplyr::select(-JSAT, -COMMIT, -READY) %>% # drop redundant columns
  ungroup() # remove grouping from data structure (we are done with group-based wrangling)

First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer)

Code

forMLM <- univbct %>% 
  dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, 
                COMMIT1, COMMIT2, COMMIT3, 
                READY1, READY2, READY3) %>% 
  
  # pivot everything but SUBNUM
  pivot_longer(names_to = "key", values_to = "value", cols=-SUBNUM) %>%
  
  # -1 splits at the last character of the variable name
  separate(col="key", into=c("variable", "occasion"), -1, convert=TRUE) %>%
  pivot_wider(names_from = variable, values_from = value) #%>% 
  #mutate(occasion=as.integer(occasion))

Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.

Code

forMLM <- forMLM %>% group_by(SUBNUM) %>% 
  mutate(across(c(COMMIT, JOBSAT, READY), list(wic=meancent, pm=mean))) %>%
  ungroup()

forMLM %>% kable_table(n=10) %>% kable_styling(font_size = 14)

SUBNUM	occasion	JOBSAT	COMMIT	READY	COMMIT_wic	COMMIT_pm	JOBSAT_wic	JOBSAT_pm	READY_wic	READY_pm
103	1	2.0	3.7	4.0	0.00	3.7	-0.56	2.6	1.25	2.8
103	2	2.3	3.7	2.0	0.00	3.7	-0.22	2.6	-0.75	2.8
103	3	3.3	3.7	2.2	0.00	3.7	0.78	2.6	-0.50	2.8
129	1	3.7	5.0	2.5	0.44	4.6	-0.56	4.2	-0.33	2.8
129	2	4.3	4.3	2.8	-0.22	4.6	0.11	4.2	-0.08	2.8
129	3	4.7	4.3	3.2	-0.22	4.6	0.44	4.2	0.42	2.8
171	1	3.7	4.0	3.2	-0.17	NA	-0.17	NA	-0.12	NA
171	2	4.0	4.3	3.5	0.17	NA	0.17	NA	0.12	NA
171	3	NA	NA	NA	NA	NA	NA	NA	NA	NA
202	1	1.3	3.3	2.5	0.44	2.9	-1.22	2.6	-0.75	3.2

Reference: Rows, columns, and groups in dplyr

Most dplyr workflows can be organized around three kinds of operations: selecting rows, selecting columns, and operating within groups.

Code

demo_q <- nhanes('DEMO_D')
bmx_q <- nhanes('BMX_D')

Rows: keep or reorder observations

Code

demo_q %>%
  filter(RIDAGEYR > 25, RIAGENDR == "Male") %>%  # filter rows by logical and: over 25 and male 
  arrange(desc(RIDAGEYR)) %>%                # order rows by descending age
  distinct(DMDHREDU, .keep_all = TRUE) %>%   # keep first row per education level for inspection
  select(SEQN, RIDAGEYR, RIAGENDR, DMDHREDU) %>% # subset down to a few variables of interest
  
  # Watch out for attempts to refer to variables you have 'selected out' earlier in the pipeline.
  # Here, DMDHRBRN isn't in the select statement above, so we can't use it in a downstream mutate call
  # mutate(
  #   in_us = if_else(DMDHRBRN == "Born in 50 US States or Washington, DC", TRUE, FALSE)
  # ) %>%
  kable_table()

SEQN	RIDAGEYR	RIAGENDR	DMDHREDU
31447	85	Male	High School Grad/GED or equivalent
31558	85	Male	9-11th Grade (Includes 12th grade with no diploma)
31610	85	Male	Some College or AA degree
32444	85	Male	Less Than 9th Grade
32455	85	Male	College Graduate or above
35818	85	Male	NA
39906	78	Male	Refused
31634	67	Male	Don't know

Row filters can also use if_any() and if_all() across multiple columns:

Here, using if_any, we find observations where weight (BMXWT in kg), arm length (BMXARML in cm), or thigh circumference (BMXTHICR in cm) is greater than 60 (setting aside the different units of measurement).

Code

bmx_q %>%
  filter(if_any(c(BMXWT, BMXARML, BMXTHICR), ~ .x > 60)) %>%  # any column above 60
  slice_sample(n = 5) %>% # get a random sample of 5 rows
  kable_table()

SEQN	BMDSTATS	BMXWT	BMIWT	BMXRECUM	BMIRECUM	BMXHEAD	BMIHEAD	BMXHT	BMIHT	BMXBMI	BMXLEG	BMILEG	BMXCALF	BMICALF	BMXARML	BMIARML	BMXARMC	BMIARMC	BMXWAIST	BMIWAIST	BMXTHICR	BMITHICR	BMXTRI	BMITRI	BMXSUB	BMISUB
33990	Partial: Height and weight obtained	79	Clothing	NA	NA	NA	NA	165	NA	29	NA	Could not obtain	NA	Could not obtain	NA	Could not obtain	NA	Could not obtain	NA	Could not obtain	NA	Could not obtain	NA	Could not obtain	NA	Could not obtain
35208	Complete data for age group	122	NA	NA	NA	NA	NA	191	NA	34	37	NA	46	NA	43	NA	36	NA	132	NA	54	NA	15.0	NA	31	NA
34966	Complete data for age group	72	NA	NA	NA	NA	NA	165	NA	27	36	NA	38	NA	36	NA	34	NA	92	NA	51	NA	8.4	NA	21	NA
35434	Complete data for age group	98	NA	NA	NA	NA	NA	177	NA	31	44	NA	39	NA	41	NA	30	NA	118	NA	55	NA	21.0	NA	23	NA
31945	Complete data for age group	89	NA	NA	NA	NA	NA	175	NA	29	36	NA	39	NA	39	NA	37	NA	117	NA	52	NA	19.2	NA	23	NA

By comparison, here are rows where all three of these are greater than 30, using if_all:

Code

bmx_q %>%
  filter(if_all(c(BMXWT, BMXARML, BMXTHICR), ~ .x > 30)) %>%  # all columns above 30
  slice_sample(n = 5) %>%
  kable_table()

SEQN	BMDSTATS	BMXWT	BMIWT	BMXRECUM	BMIRECUM	BMXHEAD	BMIHEAD	BMXHT	BMIHT	BMXBMI	BMXLEG	BMILEG	BMXCALF	BMICALF	BMXARML	BMIARML	BMXARMC	BMIARMC	BMXWAIST	BMIWAIST	BMXTHICR	BMITHICR	BMXTRI	BMITRI	BMXSUB	BMISUB
39400	Complete data for age group	73	NA	NA	NA	NA	NA	166	NA	27	37	NA	40	NA	39	NA	30	NA	96	NA	50	NA	20	NA	19.4	NA
40159	Other partial exam	74	NA	NA	NA	NA	NA	162	NA	28	36	NA	40	NA	34	NA	33	NA	94	NA	54	NA	33	NA	NA	Could not obtain
38581	Complete data for age group	36	NA	NA	NA	NA	NA	151	NA	16	33	NA	30	NA	31	NA	21	NA	60	NA	38	NA	14	NA	7.2	NA
31668	Complete data for age group	49	NA	NA	NA	NA	NA	153	Not straight	21	37	NA	32	NA	36	NA	23	NA	78	NA	43	NA	10	NA	7.6	NA
40802	Complete data for age group	93	NA	NA	NA	NA	NA	162	NA	36	37	NA	44	NA	38	NA	38	NA	115	NA	59	NA	34	NA	35.8	NA

Note: filter() drops rows where the condition is NA, so include is.na() checks if you want to keep missing values.

Code

bmx_q %>%
  filter(is.na(BMXWT) | BMXWT > 60) %>%  # keep missing weights or weights > 60
  kable_table(n = 6)

SEQN	BMDSTATS	BMXWT	BMIWT	BMXRECUM	BMIRECUM	BMXHEAD	BMIHEAD	BMXHT	BMIHT	BMXBMI	BMXLEG	BMILEG	BMXCALF	BMICALF	BMXARML	BMIARML	BMXARMC	BMIARMC	BMXWAIST	BMIWAIST	BMXTHICR	BMITHICR	BMXTRI	BMITRI	BMXSUB	BMISUB
31129	Complete data for age group	75	NA	NA	NA	NA	NA	168	NA	27	43	NA	41	NA	36	NA	33	NA	98	NA	56	NA	19	NA	18	NA
31130	No body measures exam data	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
31131	Other partial exam	75	NA	NA	NA	NA	NA	156	NA	31	38	NA	37	NA	35	NA	36	NA	96	NA	54	NA	NA	Exceeds capacity	NA	Could not obtain
31132	Complete data for age group	70	NA	NA	NA	NA	NA	168	NA	25	40	NA	36	NA	38	NA	31	NA	96	NA	48	NA	10	NA	22	NA
31134	Complete data for age group	102	NA	NA	NA	NA	NA	182	NA	31	42	NA	43	NA	43	NA	33	NA	117	NA	50	NA	13	NA	16	NA
31137	Other partial exam	80	NA	NA	NA	NA	NA	170	NA	28	45	NA	41	NA	38	NA	32	NA	89	NA	60	NA	30	NA	NA	Could not obtain

You can also take slices directly:

Code

demo_q %>%
  slice_max(RIDAGEYR, n = 3, with_ties = FALSE) %>%  # top 3 rows by age
  kable_table()

SEQN	SDDSRVYR	RIDSTATR	RIDEXMON	RIAGENDR	RIDAGEYR	RIDAGEMN	RIDAGEEX	RIDRETH1	DMQMILIT	DMDBORN	DMDCITZN	DMDYRSUS	DMDEDUC3	DMDEDUC2	DMDSCHOL	DMDMARTL	DMDHHSIZ	DMDFMSIZ	INDHHINC	INDFMINC	INDFMPIR	RIDEXPRG	DMDHRGND	DMDHRAGE	DMDHRBRN	DMDHREDU	DMDHRMAR	DMDHSEDU	SIALANG	SIAPROXY	SIAINTRP	FIALANG	FIAPROXY	FIAINTRP	MIALANG	MIAPROXY	MIAINTRP	AIALANG	WTINT2YR	WTMEC2YR	SDMVPSU	SDMVSTRA
31130	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	May 1 through October 31	Female	85	NA	NA	Non-Hispanic White	No	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	Some College or AA degree	NA	Widowed	1	1	$15,000 to $19,999	$15,000 to $19,999	1.99	NA	Female	85	Born in 50 US States or Washington, DC	Some College or AA degree	Widowed	NA	English	No	No	English	No	No	NA	NA	NA	NA	29961	34031	2	46
31149	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	May 1 through October 31	Female	85	NA	NA	Non-Hispanic White	No	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	9-11th Grade (Includes 12th grade with no diploma)	NA	Widowed	1	1	$ 0 to $ 4,999	$ 0 to $ 4,999	0.05	NA	Female	85	Born in 50 US States or Washington, DC	9-11th Grade (Includes 12th grade with no diploma)	Widowed	NA	English	No	No	English	No	No	English	No	No	English	23813	25998	2	52
31297	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	May 1 through October 31	Female	85	NA	NA	Non-Hispanic White	No	Born Elsewhere	Not a citizen of the US	50 years or more	NA	High School Grad/GED or Equivalent	NA	Widowed	2	2	$75,000 and Over	$75,000 and Over	5.00	NA	Male	47	Born in 50 US States or Washington, DC	High School Grad/GED or equivalent	Divorced	NA	English	No	No	English	No	No	English	No	No	English	33863	36880	1	46

Or take the first few rows in a group with slice_head():

Code

demo_q %>%
  group_by(RIAGENDR) %>%
  slice_head(n = 2) %>%
  kable_table()

SEQN	SDDSRVYR	RIDSTATR	RIDEXMON	RIAGENDR	RIDAGEYR	RIDAGEMN	RIDAGEEX	RIDRETH1	DMQMILIT	DMDBORN	DMDCITZN	DMDYRSUS	DMDEDUC3	DMDEDUC2	DMDSCHOL	DMDMARTL	DMDHHSIZ	DMDFMSIZ	INDHHINC	INDFMINC	INDFMPIR	RIDEXPRG	DMDHRGND	DMDHRAGE	DMDHRBRN	DMDHREDU	DMDHRMAR	DMDHSEDU	SIALANG	SIAPROXY	SIAINTRP	FIALANG	FIAPROXY	FIAINTRP	MIALANG	MIAPROXY	MIAINTRP	AIALANG	WTINT2YR	WTMEC2YR	SDMVPSU	SDMVSTRA
31127	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	May 1 through October 31	Male	0	11	12	Non-Hispanic White	NA	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	NA	NA	NA	4	4	$15,000 to $19,999	$15,000 to $19,999	0.75	NA	Female	21	Born in 50 US States or Washington, DC	High School Grad/GED or equivalent	Married	9-11th Grade (Includes 12th grade with no diploma)	English	Yes	No	English	No	No	NA	NA	NA	NA	6435	6571	2	44
31129	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	May 1 through October 31	Male	15	189	190	Non-Hispanic Black	NA	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	10th Grade	NA	In school	Never married	6	6	$65,000 to $74,999	$65,000 to $74,999	2.71	NA	Male	41	Born in 50 US States or Washington, DC	Some College or AA degree	Married	Some College or AA degree	English	Yes	No	English	No	No	English	No	No	English	5317	5587	1	51
31128	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	November 1 through April 30	Female	11	132	132	Non-Hispanic Black	NA	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	4th Grade	NA	In school	NA	7	6	$45,000 to $54,999	$20,000 to $24,999	0.77	SP not pregnant at exam	Male	47	Born in 50 US States or Washington, DC	9-11th Grade (Includes 12th grade with no diploma)	NA	NA	English	Yes	No	English	No	No	English	No	No	English	9082	8987	1	52
31130	NHANES 2005-2006 Public Release	Both Interviewed and MEC examined	May 1 through October 31	Female	85	NA	NA	Non-Hispanic White	No	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	Some College or AA degree	NA	Widowed	1	1	$15,000 to $19,999	$15,000 to $19,999	1.99	NA	Female	85	Born in 50 US States or Washington, DC	Some College or AA degree	Widowed	NA	English	No	No	English	No	No	NA	NA	NA	NA	29961	34031	2	46

Columns: choose or reshape variables

Code

demo_q %>%
  select(SEQN, RIAGENDR, RIDAGEYR) %>%    # keep columns
  rename(gender = RIAGENDR) %>%           # rename columns
  relocate(RIDAGEYR, .before = gender) %>%    # move columns
  kable_table(n = 6)

SEQN	RIDAGEYR	gender
31127	0	Male
31128	11	Female
31129	15	Male
31130	85	Female
31131	44	Female
31132	70	Male

Tidyselect patterns: select columns by name or type

Code

demo_q %>%
  select(SEQN, starts_with("RID"), ends_with("YR")) %>%
  kable_table(n = 6)

SEQN	RIDSTATR	RIDEXMON	RIDAGEYR	RIDAGEMN	RIDAGEEX	RIDRETH1	RIDEXPRG	SDDSRVYR	WTINT2YR	WTMEC2YR
31127	Both Interviewed and MEC examined	May 1 through October 31	0	11	12	Non-Hispanic White	NA	NHANES 2005-2006 Public Release	6435	6571
31128	Both Interviewed and MEC examined	November 1 through April 30	11	132	132	Non-Hispanic Black	SP not pregnant at exam	NHANES 2005-2006 Public Release	9082	8987
31129	Both Interviewed and MEC examined	May 1 through October 31	15	189	190	Non-Hispanic Black	NA	NHANES 2005-2006 Public Release	5317	5587
31130	Both Interviewed and MEC examined	May 1 through October 31	85	NA	NA	Non-Hispanic White	NA	NHANES 2005-2006 Public Release	29961	34031
31131	Both Interviewed and MEC examined	May 1 through October 31	44	535	536	Non-Hispanic Black	SP not pregnant at exam	NHANES 2005-2006 Public Release	26458	26771
31132	Both Interviewed and MEC examined	May 1 through October 31	70	842	843	Non-Hispanic White	NA	NHANES 2005-2006 Public Release	32962	35316

Code

bmx_q %>%
  select(where(is.numeric), matches("^BMX")) %>%
  kable_table(n = 6)

SEQN	BMXWT	BMXRECUM	BMXHEAD	BMIHEAD	BMXHT	BMXBMI	BMXLEG	BMXCALF	BMXARML	BMXARMC	BMXWAIST	BMXTHICR	BMXTRI	BMXSUB
31127	10	74	NA	NA	NA	NA	NA	NA	16	16	NA	NA	13	10.0
31128	40	NA	NA	NA	152	17	38	29	34	22	63	40	10	8.4
31129	75	NA	NA	NA	168	27	43	41	36	33	98	56	19	17.6
31130	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
31131	75	NA	NA	NA	156	31	38	37	35	36	96	54	NA	NA
31132	70	NA	NA	NA	168	25	40	36	38	31	96	48	10	22.2

You can also drop columns with a leading - inside select():

Code

demo_q %>%
  select(-starts_with("RID")) %>%
  kable_table(n = 6)

SEQN	SDDSRVYR	RIAGENDR	DMQMILIT	DMDBORN	DMDCITZN	DMDYRSUS	DMDEDUC3	DMDEDUC2	DMDSCHOL	DMDMARTL	DMDHHSIZ	DMDFMSIZ	INDHHINC	INDFMINC	INDFMPIR	DMDHRGND	DMDHRAGE	DMDHRBRN	DMDHREDU	DMDHRMAR	DMDHSEDU	SIALANG	SIAPROXY	SIAINTRP	FIALANG	FIAPROXY	FIAINTRP	MIALANG	MIAPROXY	MIAINTRP	AIALANG	WTINT2YR	WTMEC2YR	SDMVPSU	SDMVSTRA
31127	NHANES 2005-2006 Public Release	Male	NA	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	NA	NA	NA	4	4	$15,000 to $19,999	$15,000 to $19,999	0.75	Female	21	Born in 50 US States or Washington, DC	High School Grad/GED or equivalent	Married	9-11th Grade (Includes 12th grade with no diploma)	English	Yes	No	English	No	No	NA	NA	NA	NA	6435	6571	2	44
31128	NHANES 2005-2006 Public Release	Female	NA	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	4th Grade	NA	In school	NA	7	6	$45,000 to $54,999	$20,000 to $24,999	0.77	Male	47	Born in 50 US States or Washington, DC	9-11th Grade (Includes 12th grade with no diploma)	NA	NA	English	Yes	No	English	No	No	English	No	No	English	9082	8987	1	52
31129	NHANES 2005-2006 Public Release	Male	NA	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	10th Grade	NA	In school	Never married	6	6	$65,000 to $74,999	$65,000 to $74,999	2.71	Male	41	Born in 50 US States or Washington, DC	Some College or AA degree	Married	Some College or AA degree	English	Yes	No	English	No	No	English	No	No	English	5317	5587	1	51
31130	NHANES 2005-2006 Public Release	Female	No	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	Some College or AA degree	NA	Widowed	1	1	$15,000 to $19,999	$15,000 to $19,999	1.99	Female	85	Born in 50 US States or Washington, DC	Some College or AA degree	Widowed	NA	English	No	No	English	No	No	NA	NA	NA	NA	29961	34031	2	46
31131	NHANES 2005-2006 Public Release	Female	No	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	Some College or AA degree	NA	Married	4	4	$75,000 and Over	$75,000 and Over	4.65	Male	36	Born in 50 US States or Washington, DC	College Graduate or above	Married	Some College or AA degree	English	No	No	English	No	No	English	No	No	English	26458	26771	1	48
31132	NHANES 2005-2006 Public Release	Male	Yes	Born in 50 US States or Washington, DC	Citizen by birth or naturalization	NA	NA	College Graduate or above	NA	Married	2	2	$75,000 and Over	$75,000 and Over	5.00	Male	70	Born in 50 US States or Washington, DC	College Graduate or above	Married	College Graduate or above	English	No	No	English	No	No	English	No	No	English	32962	35316	2	52

Conditional transforms: create variables with if_else() and case_when()

Code

demo_q %>%
  mutate(
    adult = if_else(RIDAGEYR >= 18, "adult", "minor"),
    age_group = case_when(
      RIDAGEYR < 18 ~ "child",
      RIDAGEYR < 65 ~ "adult",
      TRUE ~ "older adult"
    )
  ) %>%
  select(SEQN, RIDAGEYR, age_group) %>%
  kable_table(n = 6)

SEQN	RIDAGEYR	age_group
31127	0	child
31128	11	child
31129	15	child
31130	85	older adult
31131	44	adult
31132	70	older adult

Mutate across: apply the same transformation to multiple columns

Code

bmx_q %>%
  mutate(
    across(
      c(BMXWT, BMXARML, BMXTHICR),
      ~ .x - mean(.x, na.rm = TRUE),
      .names = "centered_{.col}"
    )
  ) %>%
  dplyr::select(SEQN, starts_with("centered_")) %>%
  kable_table(n = 6)

SEQN	centered_BMXWT	centered_BMXARML	centered_BMXTHICR
31127	-49.8	-16.7	NA
31128	-19.9	1.6	-11.8
31129	14.6	3.8	4.6
31130	NA	NA	NA
31131	15.2	2.3	2.4
31132	9.5	4.8	-3.3

Groups: summarize or mutate within groups

group_by() can use multiple keys (e.g., group_by(BTN, COMPANY)) and ungroup() drops grouping when you’re done.

Code

demo_q %>%
  group_by(RIAGENDR) %>%
  summarize(
    n = n(),
    avg_age = mean(RIDAGEYR, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  kable_table()

RIAGENDR	n	avg_age
Male	5080	28
Female	5268	28

Grouping persists across verbs until you ungroup() (or use .groups in summarize()):

Code

demo_q %>%
  group_by(RIAGENDR) %>%
  mutate(mean_age = mean(RIDAGEYR, na.rm = TRUE)) %>%
  ungroup() %>%
  summarize(overall_mean = mean(RIDAGEYR, na.rm = TRUE)) %>%
  kable_table()

overall_mean
28

Quick counting helpers:

Code

demo_q %>%
  count(RIAGENDR)                      # counts per gender

RIAGENDR	n
Male	5080
Female	5268

Code

demo_q %>%
  summarize(n_edu = n_distinct(DMDHREDU))

n_edu
8

For single summaries, you can also use .by instead of group_by():

Code

demo_q %>%
  summarize(
    avg_age = mean(RIDAGEYR, na.rm = TRUE),
    .by = RIAGENDR # compute summary within gender without keeping groups
  )

RIAGENDR	avg_age
Male	28
Female	28