The stringr package provides a consistent set of functions for working with strings. All functions start with str_ and are vectorized, so they work naturally with columns in a data.frame.
We will use a small example dataset to demonstrate the core verbs.
Code
people <- tibble:: tibble (
id = 1 : 5 ,
name = c ("Ada Lovelace" , "Grace Hopper" , "Margaret Hamilton" ,
"Katherine Johnson" , "Mary Jackson" ),
email = c ("ada@navy.mil" , "grace@navy.mil" , "margaret@mit.edu" ,
"katherine@nasa.gov" , NA ),
dept = c ("CompSci" , "CompSci" , "Engineering" , "Research" , "Research" )
)
people %>% kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
2
Grace Hopper
grace@navy.mil
CompSci
3
Margaret Hamilton
margaret@mit.edu
Engineering
4
Katherine Johnson
katherine@nasa.gov
Research
5
Mary Jackson
NA
Research
Creating strings
R strings are wrapped in quotes. You can use either single or double quotes. Escapes use a backslash.
Code
Code
"He said: \" strings are useful \" "
[1] "He said: \"strings are useful\""
Code
"A backslash looks like this: \\ "
[1] "A backslash looks like this: \\"
Raw strings are useful when you want to avoid escaping backslashes:
Code
r"(C:\Users\hallquist\Documents \f ile.txt)"
[1] "C:\\Users\\hallquist\\Documents\\file.txt"
Combine strings
Code
people %>%
mutate (
label = str_c (name, " (" , dept, ")" , sep = "" )
) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
Ada Lovelace (CompSci)
2
Grace Hopper
grace@navy.mil
CompSci
Grace Hopper (CompSci)
3
Margaret Hamilton
margaret@mit.edu
Engineering
Margaret Hamilton (Engineering)
4
Katherine Johnson
katherine@nasa.gov
Research
Katherine Johnson (Research)
5
Mary Jackson
NA
Research
Mary Jackson (Research)
str_glue() is convenient for inline formatting:
Code
people %>%
mutate (label = str_glue ("{name} [{dept}]" )) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
Ada Lovelace [CompSci]
2
Grace Hopper
grace@navy.mil
CompSci
Grace Hopper [CompSci]
3
Margaret Hamilton
margaret@mit.edu
Engineering
Margaret Hamilton [Engineering]
4
Katherine Johnson
katherine@nasa.gov
Research
Katherine Johnson [Research]
5
Mary Jackson
NA
Research
Mary Jackson [Research]
If you want to collapse a vector into one string, use str_flatten():
Code
str_flatten (people$ dept, collapse = ", " )
[1] "CompSci, CompSci, Engineering, Research, Research"
String length and substrings
Code
people %>%
mutate (
n_chars = str_length (name),
first_name = str_sub (name, 1 , str_locate (name, " " )[, 1 ] - 1 )
) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
12
Ada
2
Grace Hopper
grace@navy.mil
CompSci
12
Grace
3
Margaret Hamilton
margaret@mit.edu
Engineering
17
Margaret
4
Katherine Johnson
katherine@nasa.gov
Research
17
Katherine
5
Mary Jackson
NA
Research
12
Mary
Other common helpers include str_trim() and str_squish() to handle extra whitespace:
Code
str_trim (" too much space " )
Code
str_squish ("too much space" )
Detecting patterns
str_detect() returns TRUE/FALSE for each element:
Code
people %>%
mutate (is_nasa = str_detect (email, "nasa" )) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
FALSE
2
Grace Hopper
grace@navy.mil
CompSci
FALSE
3
Margaret Hamilton
margaret@mit.edu
Engineering
FALSE
4
Katherine Johnson
katherine@nasa.gov
Research
TRUE
5
Mary Jackson
NA
Research
NA
To find rows with any missing email:
Code
people %>%
filter (is.na (email))
5
Mary Jackson
NA
Research
You can count matches with str_count():
Code
people %>%
mutate (n_vowels = str_count (name, "[aeiouAEIOU]" )) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
6
2
Grace Hopper
grace@navy.mil
CompSci
4
3
Margaret Hamilton
margaret@mit.edu
Engineering
6
4
Katherine Johnson
katherine@nasa.gov
Research
6
5
Mary Jackson
NA
Research
3
Extracting and splitting
str_extract() pulls the first match from each string:
Code
people %>%
mutate (domain = str_extract (email, "[^@]+$" )) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
navy.mil
2
Grace Hopper
grace@navy.mil
CompSci
navy.mil
3
Margaret Hamilton
margaret@mit.edu
Engineering
mit.edu
4
Katherine Johnson
katherine@nasa.gov
Research
nasa.gov
5
Mary Jackson
NA
Research
NA
Use str_split() to break strings into pieces:
Code
str_split ("Ada Lovelace" , " " )
[[1]]
[1] "Ada" "Lovelace"
If you want multiple columns, tidyr::separate() is handy:
Code
people %>%
separate (name, into = c ("first" , "last" ), sep = " " ) %>%
kable_table ()
1
Ada
Lovelace
ada@navy.mil
CompSci
2
Grace
Hopper
grace@navy.mil
CompSci
3
Margaret
Hamilton
margaret@mit.edu
Engineering
4
Katherine
Johnson
katherine@nasa.gov
Research
5
Mary
Jackson
NA
Research
Replacing
Code
people %>%
mutate (
email_safe = str_replace (email, "@" , " at " ),
email_domain = str_replace (email, ".*@" , "" )
) %>%
kable_table ()
1
Ada Lovelace
ada@navy.mil
CompSci
ada at navy.mil
navy.mil
2
Grace Hopper
grace@navy.mil
CompSci
grace at navy.mil
navy.mil
3
Margaret Hamilton
margaret@mit.edu
Engineering
margaret at mit.edu
mit.edu
4
Katherine Johnson
katherine@nasa.gov
Research
katherine at nasa.gov
nasa.gov
5
Mary Jackson
NA
Research
NA
NA
To replace all matches, use str_replace_all():
Code
str_replace_all ("A-1, B-2, C-3" , "-" , ":" )
A compact case study: parsing coded strings
Imagine IDs that pack information into a single string:
Code
ids <- tibble:: tibble (
code = c ("S01_age=21" , "S02_age=19" , "S03_age=22" , "S04_age=20" )
)
ids %>% kable_table ()
S01_age=21
S02_age=19
S03_age=22
S04_age=20
We can extract the subject ID and age using str_match():
Code
ids %>%
mutate (
subject = str_match (code, "^(S \\ d{2})" )[, 2 ],
age = as.numeric (str_match (code, "age=( \\ d{2})$" )[, 2 ])
) %>%
kable_table ()
S01_age=21
S01
21
S02_age=19
S02
19
S03_age=22
S03
22
S04_age=20
S04
20
Summary
The stringr toolkit is built around a small set of verbs:
Create/Combine : str_c(), str_glue(), str_flatten()
Inspect : str_length(), str_detect(), str_count()
Extract/Split : str_extract(), str_match(), str_split()
Modify : str_sub(), str_replace(), str_replace_all(), str_trim()
For more detail and worked examples, see R4DS Ch. 14: https://r4ds.hadley.nz/strings.html
Appendix: regex primer (optional)
Regular expressions (regex) let you describe patterns in text. stringr uses standard regex syntax in most str_ functions.
Core building blocks:
. any character
^ start of string, $ end of string
* zero or more, + one or more, ? zero or one
[] character class (e.g., [A-Z], [0-9])
() grouping and capture
| OR
Common examples:
Code
str_detect ("room 312" , " \\\\ d+" ) # any digits
Code
str_detect ("A12" , "^[A-Z] \\\\ d{2}$" ) # one letter, two digits
Code
str_extract ("x=42" , " \\\\ d+" ) # extract digits
Code
str_replace ("abc-123" , "^[a-z]+-" , "" ) # drop leading letters and dash
Useful helpers:
str_detect(x, pattern) returns TRUE/FALSE
str_extract(x, pattern) returns the first match
str_replace(x, pattern, replacement) replaces the first match
str_replace_all(x, pattern, replacement) replaces all matches
For a deeper treatment and practice, see R4DS Ch. 15: https://r4ds.hadley.nz/regexps.html