Week 1: Data Tidying

Author

PSYC 859 Data Management and Visualization

Published

January 8, 2026

This assignment is due by 1/21/2026 at 8am.

Instructions

For each question either write the code you would use or copy and paste it from the RStudio syntax window. Additionally, paste any (reasonable) output generated by the code. If there is a lot of output, paste enough to see that you were able to get the correct answer.

Datasets for Vignettes

Load the who2 dataset from the tidyr package into your working environment.

Code

data("who2", package = "tidyr")

This dataset records counts of tuberculosis by country and year. The other values correspond to a method of diagnosis, sex, and age group. The method of diagnosis codes are: rel = relapse, sn = negative pulmonary smear, sp = positive pulmonary smear, ep = extrapulmonary. For example, sp_m_014 corresponds to positive pulmonary smear (sp), male sex (m), and ages 0-14 (014).

You will see many NA values in who2 because some country-year combinations do not report counts for certain diagnosis/sex/age groups. When you summarize totals, remember to use na.rm = TRUE so missing values do not turn your results into NA.

Additionally, load the Pew relig_income dataset from the tidyr package:

Code

data("relig_income", package = "tidyr")

This dataset describes the relationship between income and religion.

Basic Pew Dataset Structure

Using the Pew dataset

Look at the first and last five observations of the Pew dataset using head() and tail(). Is the dataset considered tidy? Why or why not?

Look at the structure (str()) and class (class()) of the data. If this is not tidy, verbally describe how it would look if it were a tidy dataset. If it is tidy, is it in a format that you would store the data in long-term?

Tidying Pew Data

Use tidyr verbs to tidy the Pew dataset. Your tidy dataset should have columns named religion, income, and count. Show the first five rows.

Tidying Tuberculosis Data

Using the who2 dataset

Explore the who2 dataset using the methods described in the section above.

Using functions from the tidyr package, tidy the who2 dataset so there is a case_group variable and a cases variable. Use those names throughout the rest of the assignment. Show the first five rows of your new dataframe by using head(). Remember, a tidy dataset consists of:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

The values within the case_group variable are still considered not tidy because they represent three observations in one (case type, sex, and age). Use tidyr verbs to separate this variable into type, sex, and age (hint: the values are separated by _).

Check that each row is a unique observation by counting duplicates of country, year, type, sex, and age. Report whether any combinations appear more than once.

Rename the values within sex and age to be more descriptive of what they represent (male or female, 0-4, 5-14, etc.). You can use a combination of dplyr::mutate() and dplyr::recode() to recode the values. Use ?recode if you get stuck.

Take your new tidy dataframe with the recoded values and variables and demonstrate tidyr::unite() by recombining sex and age into a single variable (e.g., sex_age). This is just to practice unite() before you mess it up again in the next step. Show the first five rows of your new dataframe.

Go back to the dataframe you created in question two. Using that tidy dataframe, use tidyr verbs to recreate a ‘messy’ dataframe. It should look exactly the same (or similar) as when you first loaded the who2 dataset into your environment. Show the first five rows of your new dataframe. If the code throws a ‘duplicate identifiers error’ you should use dplyr::distinct() in your pipeline.

Arrange the dataset to be descending by most cases of tuberculosis to least using dplyr verbs (use your cases variable).

Summarize how many cases of tuberculosis there are by sex and age (use your cases variable). Show your output. If you get NA values for everything, remember to remove those empty values somewhere in your pipeline! Hint Look at the default arguments for the functions you use to summarize your data.

Finally, summarize the dataset by country and sex, then add a variable that is the relative prevalence by sex in each country.