Week 3: Data Wrangling

Author

PSYC 859 Data Management and Visualization

Published

January 29, 2026

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

data("economics_long")

This assignment is optional and intended for practice. If you choose to submit it, you can email it to the instructor.

Instructions

Follow along with the tutorial and, for each question, write the code you would use (or copy it from the RStudio syntax window). Additionally, paste any reasonable output generated by your code. If there is a lot of output, paste just enough to demonstrate that you reached the correct answer.

Dataset for assignment

Use the economics_long dataframe from the ggplot2 package for this assignment. You can load it into your working environment with data("economics_long") after you load the tidyverse package.

Tidy the data

Format the data so it is considered tidy. Hint: You will need to delete value01.
Create a new dataframe that summarizes the data by year. Include at the very least mean and sd for each variable.
Filter your summary dataframe so that only data between the years of 1975 and 2015 is included. You may have to make year numeric by using mutate(). Hint: check ?dplyr::between.

Joins

Use the provided housing dataset at files/national-month.csv.gz and tidy it if necessary. Summarize the data by year.
Perform a high-fidelity join between the housing dataset and the tidy economics dataset by year. How do you know the join was successful? Include a brief check (e.g., row counts, anti_join(), or count() of unmatched years).

Core dplyr practice

Using the tidy economics dataset:

select only date, variable, and value, then rename value to series_value.
arrange the data by variable, then descending series_value.
Use slice_max (or slice_min) to return the top 5 (or bottom 5) rows per variable.

Use case_when to create a categorical column that bins series_value into at least three levels (e.g., “low”, “medium”, “high”). Then use count to report how many observations fall in each category by variable.
Use across inside summarize to compute the mean and standard deviation of series_value for each variable after filtering to the years 1990 to 2015.

Data exploration and processing

What would constitute invalid or outlier observations on the basis of rational criteria? Search the dataframe for any values that meet your outlier/invalid criteria by either writing a custom function or chaining together several tidyverse functions.