The goal of this document is to provide a basic introduction to data wrangling using functions from the so-called ‘tidyverse’ approach. The tidyverse (https://www.tidyverse.org) is a set of data science packages in R that are intended to provide a consistent paradigm for working with data. This approach unifies a previously inchoate landscape of different functions and packages in R that could be daunting to new users.
Although I do not claim that the tidyverse approach is best according to all possible criteria, I believe that it is the best paradigm for working with data in R for social scientists, many of whom do not have a formal background in computer programming.
Here, I will draw primarily from the tidyr and dplyr packages in R.
For an excellent book-length treatment of the tidyverse approach, see R for Data Science (2nd edition) by Hadley Wickham, Mine Cetinkaya-Rundel, and Garrett Grolemund.
Before we start: beware namespace collisions!
One of the most irritating problems you may encounter in the tidyverse world (and more generally, in R) is when code that previously worked suddenly throws an inexplicable error.
For example:
> survey %>% group_by(R_exp) %>%
summarize(m_age=mean(Psych_age_yrs), sd_age=sd(Psych_age_yrs))
Error in summarize(., m_age = mean(Psych_age_yrs), sd_age = sd(Psych_age_yrs)) :
argument "by" is missing, with no default
By using fairly intuitive data wrangling verbs such as ‘summarize’ and ‘select’, dplyr (and sometimes tidyr) sometimes uses the same function names as other packages. For example, Hmisc has a summarize function that does not operate in the same way as summarize in dplyr. Also, the predecessor to dplyr was called plyr. Although largely outmoded, it has a few remaining functions that may be useful. But… many of these functions have the same names in dplyr but operate differently (the syntax is not the same!), which can be a common source of collisions when using dplyr.
This points to the problem of what are called ‘namespace collisions.’ That is, when R looks for a function (or any object) in the global environment, it searches through a ‘path’. You can see the nitty gritty using searchpaths(). But the TL;DR is that if you – or any function you call on – loads another package, that package may override a dplyr function and make your code crash!
Addressing namespace collisions
Watch out for warnings about objects being ‘masked’ when packages are loaded.
Explicitly specify the package where your desired function lives using the double colon operator. Example: dplyr::summarize.
Try to load tidyverse packages using library(tidyverse). This handles collisions within the tidyverse!
For example, in this document, multilevel pulls in MASS, which masks dplyr::select. If we were to load dplyr first, then multilevel, we would see the following:
> library(dplyr)
> library(multilevel)
Loading required package: nlme
Attaching package: ‘nlme’
The following object is masked from ‘package:dplyr’:
collapse
Loading required package: MASS
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Read in English, this says, “The multilevel package needs to load nlme (a mixed-effects package). Nlme has a function called ‘collapse’ that will now take precedence over ‘collapse’ in dplyr if you run collapse() in your session. Multilevel also needs the MASS package, so I’m loading it. The MASS package has a function called ‘select’ that now takes precedence over dplyr’s select function, so if you call select() in your session, you will be getting the MASS package function, not the dplyr one.”
To get around these problems, the easiest route is to load dplyr (and other tidyverse packages) last in the chain, giving them precedence over other packages. As you see above in the p_load call, we load dplyr last in the chain so that its select function is preferred over MASS. This is important because functions of the same name across packages may have completely different purposes and syntax!
The slightly harder, but more robust, route is always to use explicit namespace qualifiers with the :: operator. If you use dplyr::select as the function call, you are explicitly telling R that you want dplyr’s select function, not some other package’s.
The two most common collisions are with select (MASS) and summarize (Hmisc). If you’re not sure which version of a function has taken precedence you can type the function name without parentheses. This will print the function source (which you can ignore). But take a look at the last line, which reveals where R ‘found’ the function in the search across loaded packages (based on their load order/precedence).
> select
function (obj)
UseMethod("select")
<bytecode: 0x93b18ebd8>
<environment: namespace:MASS>
If you were expecting that last line to read <environment: namespace:dplyr> (i.e. dplyr’s version), then you have a namespace problem to resolve!
Data pipelines
The tidyverse paradigm encourages the use of so-called data pipelines when writing the syntax for a multi-step data transformation procedure. The pipe operator %>% is provided by the magrittr package, which is loaded by dplyr. Data pipeline syntax is intended to provide a readable syntax for the order in which data operations are performed. You can think of this as a recipe for data wrangling. For example:
1. Read data from the file: 'mydata.csv'
2. Rename SDATE to submission_date; rename ed to education
3. Only keep subject IDs above 10 (the first 9 were in-lab pilot subjects)
4. Drop the timestamp and survey_settings variables
5. Compute the log of income as a new variable called log_income
6. Arrange the data by ID, then submission_date (where each subject submitted many surveys)
7. Ensure that ID and submission_date are the left-most columns in the data
We would write this out as a dplyr pipeline using the pipe operator %>% to chain together data operations.
Sometimes it is useful to refer to the current dataset or variable explicitly in tidyverse data wrangling syntax.
dplyr/magrittr tends to hide this from us for convenience, but it’s there under the hood.
iris %>% filter(Sepal.Length > 7)
is the same as
iris %>% filter(., Sepal.Length > 7)
So, '.' refers to the current dataset or variable (depending on context) in dplyr operations. And if you don’t specify where the '.' falls in your syntax, it will always be passed as the first argument to the downstream function.
Special values to watch out for in data wrangling
NA: missing
na.rm=TRUE available in many functions
Also, see na.omit(), na.exclude(), na.fail(), na.pass()
NULL: null set
Often used when something is undefined
Inf: infinite
NaN: Not a number.
Result of an invalid computation, e.g., log(-1)
warnings(): If R mentions a warning in data wrangling, make sure you’ve handled it or know its origin. Don’t just ignore these!
dplyr fundamentals
The dplyr package is at the core of data wrangling in the tidyverse, providing a set of wrangling verbs that collectively form a coherent language for data transformations.
Overview: a first pass through an NHANES dataset
To learn dplyr, let’s start with a survey from the National Health and Nutrition Examination Survey (NHANES) dataset. These data are provided in the nhanesA package. We’ll start by looking at a couple of basic demographic variables.
NHANES variables used below (from DEMO_D and BMX_D):
SEQN: respondent ID
RIAGENDR: gender (coded numeric in raw NHANES)
RIDAGEYR: age in years
DMDHREDU: education level
BMXWT: weight (kg)
BMXARML: upper arm length (cm)
BMXTHICR: thigh circumference (cm)
Code
# demographics: note that this code relies on functions in the nhanesA package, which# is useful for working with NHANES data, but is not a focus of this workshop per se.demo_d <-nhanes('DEMO_D')demo_d_vars <-nhanesTableVars('DEMOGRAPHICS', 'DEMO_D', namesonly=TRUE)# translate numeric codes into factors by using the nhanes lookup/codebook functionsdemo_d <-suppressWarnings(nhanesTranslate('DEMO_D', demo_d_vars, data=demo_d))
# dplyr pipelinedemo_d <- demo_d %>%filter(!INDHHINC %in%c("Over $20,000", "Under $20,000", "Refused", "Don't know")) %>%droplevels() %>%# drop unused factor levels from the data.framemutate(# case_when() is a vectorized if/else ladder for many conditionsincome_num =case_when( # convert range-based factor labels to midpoint numbers INDHHINC =="$ 0 to $ 4,999"~2500, INDHHINC =="$ 5,000 to $ 9,999"~7500, INDHHINC =="$10,000 to $14,999"~12500, INDHHINC =="$15,000 to $19,999"~17500, INDHHINC =="$20,000 to $24,999"~22500, INDHHINC =="$25,000 to $34,999"~30000, INDHHINC =="$35,000 to $44,999"~40000, INDHHINC =="$45,000 to $54,999"~50000, INDHHINC =="$55,000 to $64,999"~60000, INDHHINC =="$65,000 to $74,999"~70000, INDHHINC =="$75,000 and Over"~80000 ) )#load body (biometric) measuresbmx_d <-nhanes('BMX_D')bmx_d_vars <-nhanesTableVars('EXAM', 'BMX_D', namesonly=TRUE)bmx_d <-suppressWarnings(nhanesTranslate('BMX_D', bmx_d_vars, data=bmx_d))
# merge the education demographics variable with the biometric data, joining on the SEQN column (basically an ID)bmx_d <- demo_d %>% dplyr::select(SEQN, DMDHREDU) %>%inner_join(bmx_d, by="SEQN")
group_by + summarize
Let’s summarize the mean household income (income_num), converted from categories in INDHHINC) by highest level of education completed (DMDHREDU)
Code
demo_d %>%group_by(DMDHREDU) %>%# divide dataset into separate compartments by education level dplyr::summarize(n=n(), # number of observationsm_income=mean(income_num, na.rm=T), sd_income=sd(income_num, na.rm=T) ) %>%kable_table()
DMDHREDU
n
m_income
sd_income
Less Than 9th Grade
1173
28024
18489
9-11th Grade (Includes 12th grade with no diploma)
1626
31318
21584
High School Grad/GED or equivalent
2317
39400
23422
Some College or AA degree
2664
48137
25326
College Graduate or above
1768
63785
21816
Refused
6
25000
12624
Don't know
26
20900
15760
NA
276
44623
24849
Note that summarize removes a single level of grouping in the group_by process. Here, we only have one grouping variable, DMDHREDU, so the output of summarize will be ‘ungrouped.’
Grouped summaries of several variables
What if I want to have means and SDs for several continuous variables grouped by highest education? Let’s look specifically at the weight (BMXWT), upper arm length (BMXARML), and thigh circumference (BMXTHICR) measurements. The combination of summarize and across provide functionality to specify several variables using the .cols argument of across and potentially several summary functions by passing them in a named list.
9-11th Grade (Includes 12th grade with no diploma)
56
33
32
8.2
51
8.40
High School Grad/GED or equivalent
61
33
33
7.9
52
8.10
Some College or AA degree
62
32
33
7.6
52
8.65
College Graduate or above
61
31
33
7.8
52
7.99
Refused
70
25
38
2.5
50
0.57
Don't know
50
36
28
11.5
53
7.74
NA
56
29
32
7.6
51
7.73
Let’s slow this down:
group_by verb
survey %>% group_by(DMDHREDU)
This tells dplyr to divide the bmx_d (NHANES biometric measurements) data into a set of smaller data.frame objects, one per level of DMDHREDU. Internally, this looks something like the output below. After this division of the dataset into chunks, summarize will work on each chunk individually.
$`Less Than 9th Grade`
SEQN DMDHREDU BMDSTATS BMXWT
1 31137 Less Than 9th Grade Other partial exam 80
2 31157 Less Than 9th Grade Complete data for age group 42
3 31169 Less Than 9th Grade Complete data for age group 22
4 31175 Less Than 9th Grade Complete data for age group 86
$`9-11th Grade (Includes 12th grade with no diploma)`
SEQN DMDHREDU
1 31128 9-11th Grade (Includes 12th grade with no diploma)
2 31133 9-11th Grade (Includes 12th grade with no diploma)
3 31145 9-11th Grade (Includes 12th grade with no diploma)
4 31148 9-11th Grade (Includes 12th grade with no diploma)
BMDSTATS BMXWT
1 Complete data for age group 40
2 Complete data for age group 45
3 Complete data for age group 40
4 Complete data for age group 52
$`High School Grad/GED or equivalent`
SEQN DMDHREDU BMDSTATS BMXWT
1 31127 High School Grad/GED or equivalent Complete data for age group 10
2 31139 High School Grad/GED or equivalent Other partial exam 74
3 31140 High School Grad/GED or equivalent Complete data for age group 42
4 31143 High School Grad/GED or equivalent Complete data for age group 76
$`Some College or AA degree`
SEQN DMDHREDU BMDSTATS BMXWT
1 31129 Some College or AA degree Complete data for age group 75
2 31130 Some College or AA degree No body measures exam data NA
3 31138 Some College or AA degree Complete data for age group 14
4 31142 Some College or AA degree Other partial exam 80
$`College Graduate or above`
SEQN DMDHREDU BMDSTATS BMXWT
1 31131 College Graduate or above Other partial exam 75
2 31132 College Graduate or above Complete data for age group 70
3 31135 College Graduate or above Complete data for age group 10
4 31141 College Graduate or above Complete data for age group 60
$Refused
SEQN DMDHREDU BMDSTATS BMXWT
1 33385 Refused Complete data for age group 78
2 35296 Refused Partial: Height and weight obtained 38
3 37761 Refused Complete data for age group 67
4 40762 Refused Partial: Height and weight obtained 99
$`Don't know`
SEQN DMDHREDU BMDSTATS BMXWT
1 31546 Don't know Other partial exam 78
2 32106 Don't know Complete data for age group 12
3 32626 Don't know Complete data for age group 64
4 32648 Don't know Complete data for age group 85
summarize + across
The summarize function transforms a dataset that has many rows to a dataset that has a single row per grouping unit. If you do not use group_by, summarize will yield an overall summary statistic in the entire dataset. For example, to get the mean and SD of household income in NHANES, irrespective of education level or other categorical moderators, we could just use a simple summarize:
But because we used group_by(DMDHREDU) above, we got unique summaries of the variables at each level of education.
The across function accepts two primary arguments. First, we specify a set of variables (the .cols argument) that we wish to summarize in the same way (i.e., compute the same summary statistics). Second, we specify which statistics we wish to compute (the .fns argument). In our case, the syntax was:
The c() function specifies a vector of unquoted variable names in the dataset we wish to summarize, separated by commas. Note that any tidyselect operator for selecting columns will work. This includes starts_with, ends_with, matches, and others. For details, see ?dplyr_tidy_select.
The list() here asks dplyr to run each function in the list against each variables in the .cols specification. Here, this means that dplyr will compute the mean and SD of each variable in the .cols argument at each level of education completed (the group_by basis). The names of the list elements (left side) — here, m and sd — become the suffixes added for each variable. The value of the element (right side) — here, mean and sd — are the functions that should be used to compute a summary statistic (they should return one number per grouped variable).
Note that the column names that result from an across operation can be modified using the .names argument. The default is to suffix the variable name with the names of the functions list, preceded by an underscore (e.g., income_m).
Passing arguments to summary functions
Notice how we passed na.rm=TRUE to the mean function within the list. This tells the mean to ignore missing (NA) values when computing the mean (i.e., mean of the non-missing numbers). In general, the dplyr syntax using what they call “lambdas” (starting with ~) is the clearest way to control the arguments passed to each function. Here is a simple definition of a lambda in R:
~ mean(.x, na.rm=TRUE)
The .x refers to the current variable being used within a dplyr data wrangling operation. This is in contrast to ., which generally refers to the current dataset.
If you don’t need to pass arguments to the functions in an across() operation, you can just state the function name:
We can also make the output more beautiful using tidying techniques we’ve already seen in the tidyr tutorial. Remember that R is all about programming for data science. In particular, notice that we have some columns that are means and others that are SDs.
We can just extend our data pipeline a bit. The extract function from tidyr here is like separate, but with a bit more oomph using regular expressions. This is a more intermediate topic, but there is a useful tutorial here: http://www.regular-expressions.info/tutorial.html.
Code
bmx_d %>%group_by(DMDHREDU) %>% dplyr::summarize(across(c(BMXWT, BMXARML, BMXTHICR), list(m=~mean(.x, na.rm=T), sd=~sd(.x, na.rm=T)) ) ) %>%# combine m and sd statistics (notice how you can add comments inline within a pipeline?)pivot_longer(cols=-DMDHREDU, names_to ="Measure", values_to ="value") %>%# divide income_m into income and m#extract(col=Measure, into=c("bio_measure", "statistic"), regex=("(.*)_(.*)$")) %>% separate(col=Measure, into=c("bio_measure", "statistic"), sep="_") %>%pivot_wider(names_from=statistic, values_from = value) %>%arrange (bio_measure, DMDHREDU) %>%kable_table()
DMDHREDU
bio_measure
m
sd
Less Than 9th Grade
BMXARML
32
7.77
9-11th Grade (Includes 12th grade with no diploma)
BMXARML
32
8.22
High School Grad/GED or equivalent
BMXARML
33
7.86
Some College or AA degree
BMXARML
33
7.61
College Graduate or above
BMXARML
33
7.81
Refused
BMXARML
38
2.47
Don't know
BMXARML
28
11.49
NA
BMXARML
32
7.62
Less Than 9th Grade
BMXTHICR
50
7.48
9-11th Grade (Includes 12th grade with no diploma)
BMXTHICR
51
8.40
High School Grad/GED or equivalent
BMXTHICR
52
8.10
Some College or AA degree
BMXTHICR
52
8.65
College Graduate or above
BMXTHICR
52
7.99
Refused
BMXTHICR
50
0.57
Don't know
BMXTHICR
53
7.74
NA
BMXTHICR
51
7.73
Less Than 9th Grade
BMXWT
58
30.15
9-11th Grade (Includes 12th grade with no diploma)
BMXWT
56
33.44
High School Grad/GED or equivalent
BMXWT
61
32.70
Some College or AA degree
BMXWT
62
32.48
College Graduate or above
BMXWT
61
31.30
Refused
BMXWT
70
25.08
Don't know
BMXWT
50
36.00
NA
BMXWT
56
29.12
arrange: order observations
Toward the end of the pipeline above, we see:
arrange (bio_measure, DMDHREDU) %>%
The arrange verb in dplyr requests that observations be sorted according to one or more variables. Here, we ask for the dataset to be sorted by bio_measure (biometric measure, such as weight) first, then by education level within that measure. The arrange verb sorts observations in ascending order (low to high) by default, but data can be sorted in descending order using the desc() function:
arrange(bio_measure, desc(DMDHREDU)) %>%
This would sort by highest to lowest education level within each measure (bio_measure), where the bio_measures are still in ascending (alphabetical) order
Examining a more complex multilevel dataset using dplyr
Let’s examine the univbct data, which contains longitudinal observations of job satisfaction, commitment, and readiness to deploy. From the documentation ?univbct:
This data set contains the complete data set used in Bliese and Ployhart (2002). The data is longitudinal data converted to univariate (i.e., stacked) form. Data were collected at three time points. A data frame with 22 columns and 1485 observations from 495 individuals.
Let’s enact the core ‘verbs’ of dplyr to understand and improve the structure of these data.
filter: obtaining observations (rows) based on some criteria
Objective: Retain only men in company A
Code
COMPANY="A"COMPANY <-"A"company_A_men <-filter(univbct, COMPANY=="A"& GENDER==1)#print 10 observations at random to check the accuracy of the filter#p=11 just shows the first 11 columns to keep it on one page for formattingcompany_A_men %>%sample_n(10) %>%kable_table(p=11)
BTN
COMPANY
MARITAL
GENDER
HOWLONG
RANK
EDUCATE
AGE
JOBSAT1
COMMIT1
READY1
232
1022
A
2
1
2
15
2
29
3.7
3.3
3.2
1373
404
A
2
1
1
16
4
36
4.0
4.0
3.2
1395
3066
A
2
1
2
15
2
26
4.0
4.0
3.8
1235
3066
A
2
1
2
16
2
34
3.0
3.7
3.0
1420
3066
A
2
1
3
13
3
21
3.0
3.7
2.8
1208
4
A
2
1
3
15
2
28
2.3
2.7
1.8
233
1022
A
2
1
2
15
2
29
3.7
3.3
3.2
1159
299
A
2
1
3
14
2
24
2.7
3.0
2.5
445
1022
A
1
1
1
14
2
23
3.7
3.3
3.5
616
4042
A
1
1
2
13
2
19
1.3
3.3
2.5
Objective: Count how many people are in companies A and B
Code
filter(univbct, COMPANY %in%c("A","B")) %>%nrow()
[1] 750
Objective: What about counts by company and battalion?
If you have many variables of a similar name, you might try starts_with(). Note in this case that it brings in “READY”, too. Note that you can mix different selection mechanisms within select. Look at the cheatsheet.
Mutate can manipulate several variables in one call. Here, mean center any variable that starts with COMMIT and add the suffix _cm for clarity. Also compute the percentile rank for each of these columns, with _pct as suffix. Note the use of the starts_with function here within the across(). This operates identically to select, but in the context of a summary or mutation operation on specific variables. See ?select_helpers for details.
Code
meancent <-function(x) { x -mean(x, na.rm=TRUE) } #simple worker function to mean center a variableunivbct <- univbct %>%mutate(across(starts_with("COMMIT", ignore.case =FALSE), list(cm=meancent, pct=percent_rank)))univbct %>% dplyr::select(starts_with("COMMIT", ignore.case =FALSE)) %>%kable_table(n=8) %>%kable_styling(font_size =12)
COMMIT1
COMMIT2
COMMIT3
COMMIT
COMMIT1_cm
COMMIT1_pct
COMMIT2_cm
COMMIT2_pct
COMMIT3_cm
COMMIT3_pct
COMMIT_cm
COMMIT_pct
1.7
1.7
3.0
1.7
-1.95
0.01
-1.80
0.04
-0.54
0.12
-1.87
0.02
1.7
1.7
3.0
1.7
-1.95
0.01
-1.80
0.04
-0.54
0.12
-1.87
0.02
1.7
1.7
3.0
3.0
-1.95
0.01
-1.80
0.04
-0.54
0.12
-0.54
0.15
1.7
1.3
1.3
1.7
-1.95
0.01
-2.13
0.03
-2.20
0.01
-1.87
0.02
1.7
1.3
1.3
1.3
-1.95
0.01
-2.13
0.03
-2.20
0.01
-2.21
0.01
1.7
1.3
1.3
1.3
-1.95
0.01
-2.13
0.03
-2.20
0.01
-2.21
0.01
3.3
3.3
3.7
3.3
-0.28
0.27
-0.13
0.33
0.13
0.45
-0.21
0.30
3.3
3.3
3.7
3.3
-0.28
0.27
-0.13
0.33
0.13
0.45
-0.21
0.30
arrange: reorder observations in specific order
Order data by ascending battalion, company, then subnum
A more realistic example: preparation for multilevel analysis
In MLM, one strategy for disentangling within- versus between-person effects is to include both within-person-centered variables and person means in the model (Curran & Bauer, 2011).
We can achieve this easily for our three DVs here using a single pipeline that combines tidying and mutation. Using -1 as the sep argument to separate splits the string at the second-to-last position (i.e., starting at the right).
For reshaping to work smoothly, we need a unique identifier for each row. Also, univbct is stored in a dangerously untidy format in which variables with suffix 1-3 indicate a ‘wide format’, but the data is also in long format under variables such as ‘JSAT’ and ‘COMMIT.’ In other words, there is a peculiar redundancy in the data that is altogether confusing.
We first need to eliminate this insanity. Group by subject number and retain only the first row (i.e., keep the wide version).
Code
univbct <- univbct %>%group_by(SUBNUM) %>%# split into separate groups for each subjectfilter(row_number() ==1) %>%# only retain the first row of each subject dplyr::select(-JSAT, -COMMIT, -READY) %>%# drop redundant columnsungroup() # remove grouping from data structure (we are done with group-based wrangling)
First, let’s get the data into a conventional format (long) for MLM (e.g., using lmer)
Code
forMLM <- univbct %>% dplyr::select(SUBNUM, JOBSAT1, JOBSAT2, JOBSAT3, COMMIT1, COMMIT2, COMMIT3, READY1, READY2, READY3) %>%# pivot everything but SUBNUMpivot_longer(names_to ="key", values_to ="value", cols=-SUBNUM) %>%# -1 splits at the last character of the variable nameseparate(col="key", into=c("variable", "occasion"), -1, convert=TRUE) %>%pivot_wider(names_from = variable, values_from = value) #%>% #mutate(occasion=as.integer(occasion))
Now, let’s perform the centering described above. You could do this in one pipeline – I just separated things here for conceptual clarity.
Most dplyr workflows can be organized around three kinds of operations: selecting rows, selecting columns, and operating within groups.
Code
demo_q <-nhanes('DEMO_D')bmx_q <-nhanes('BMX_D')
Rows: keep or reorder observations
Code
demo_q %>%filter(RIDAGEYR >25, RIAGENDR =="Male") %>%# filter rows by logical and: over 25 and male arrange(desc(RIDAGEYR)) %>%# order rows by descending agedistinct(DMDHREDU, .keep_all =TRUE) %>%# keep first row per education level for inspectionselect(SEQN, RIDAGEYR, RIAGENDR, DMDHREDU) %>%# subset down to a few variables of interest# Watch out for attempts to refer to variables you have 'selected out' earlier in the pipeline.# Here, DMDHRBRN isn't in the select statement above, so we can't use it in a downstream mutate call# mutate(# in_us = if_else(DMDHRBRN == "Born in 50 US States or Washington, DC", TRUE, FALSE)# ) %>%kable_table()
SEQN
RIDAGEYR
RIAGENDR
DMDHREDU
31447
85
Male
High School Grad/GED or equivalent
31558
85
Male
9-11th Grade (Includes 12th grade with no diploma)
31610
85
Male
Some College or AA degree
32444
85
Male
Less Than 9th Grade
32455
85
Male
College Graduate or above
35818
85
Male
NA
39906
78
Male
Refused
31634
67
Male
Don't know
Row filters can also use if_any() and if_all() across multiple columns:
Here, using if_any, we find observations where weight (BMXWT in kg), arm length (BMXARML in cm), or thigh circumference (BMXTHICR in cm) is greater than 60 (setting aside the different units of measurement).
Code
bmx_q %>%filter(if_any(c(BMXWT, BMXARML, BMXTHICR), ~ .x >60)) %>%# any column above 60slice_sample(n =5) %>%# get a random sample of 5 rowskable_table()
SEQN
BMDSTATS
BMXWT
BMIWT
BMXRECUM
BMIRECUM
BMXHEAD
BMIHEAD
BMXHT
BMIHT
BMXBMI
BMXLEG
BMILEG
BMXCALF
BMICALF
BMXARML
BMIARML
BMXARMC
BMIARMC
BMXWAIST
BMIWAIST
BMXTHICR
BMITHICR
BMXTRI
BMITRI
BMXSUB
BMISUB
33990
Partial: Height and weight obtained
79
Clothing
NA
NA
NA
NA
165
NA
29
NA
Could not obtain
NA
Could not obtain
NA
Could not obtain
NA
Could not obtain
NA
Could not obtain
NA
Could not obtain
NA
Could not obtain
NA
Could not obtain
35208
Complete data for age group
122
NA
NA
NA
NA
NA
191
NA
34
37
NA
46
NA
43
NA
36
NA
132
NA
54
NA
15.0
NA
31
NA
34966
Complete data for age group
72
NA
NA
NA
NA
NA
165
NA
27
36
NA
38
NA
36
NA
34
NA
92
NA
51
NA
8.4
NA
21
NA
35434
Complete data for age group
98
NA
NA
NA
NA
NA
177
NA
31
44
NA
39
NA
41
NA
30
NA
118
NA
55
NA
21.0
NA
23
NA
31945
Complete data for age group
89
NA
NA
NA
NA
NA
175
NA
29
36
NA
39
NA
39
NA
37
NA
117
NA
52
NA
19.2
NA
23
NA
By comparison, here are rows where all three of these are greater than 30, using if_all: