Week 9/10: Graphical Idioms in ggplot2

Author

PSYC 859 Data Management and Visualization

Published

March 12, 2026

This assignment is designed as guided practice for building ggplot objects from layers. The week 9 and week 10 materials emphasize that plots are not single opaque commands. Instead, they are built from data, mappings, geoms, stats, positions, scales, coordinates, and themes.

Instructions

For each question:

  1. Build the plot using ggplot2 layers. You may either write the plot in one expression or create and modify an object in multiple steps.
  2. Include the code you used and the resulting plot.
  3. When a question asks for interpretation, write 1-3 sentences.

Unless a prompt tells you otherwise, this is a good default workflow:

  1. Start with ggplot(...) and map the core variables.
  2. Add the necessary geoms, stats, labels, and theme elements.
  3. Use intermediate objects only if they help you think more clearly.

Datasets for this assignment

We will use three datasets that are relevant to social and behavioral science questions:

  • gss_cat from forcats: responses from the General Social Survey.
  • economics from ggplot2: U.S. macroeconomic indicators measured over time.
  • midwest from ggplot2: county-level demographic and socioeconomic data from the U.S. Midwest.

Use the chunks below to inspect the data before you begin.

Code
glimpse(gss_cat)
Rows: 21,483
Columns: 9
$ year    <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
$ age     <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51, 52, 40…
$ race    <fct> White, White, White, White, White, White, White, White, White,…
$ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
$ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
$ relig   <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
$ denom   <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
$ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, NA, 3, 3…
Code
glimpse(economics)
Rows: 574
Columns: 6
$ date     <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967-11-01, …
$ pce      <dbl> 506.7, 509.8, 515.6, 512.2, 517.4, 525.1, 530.9, 533.6, 544.3…
$ pop      <dbl> 198712, 198911, 199113, 199311, 199498, 199657, 199808, 19992…
$ psavert  <dbl> 12.6, 12.6, 11.9, 12.9, 12.8, 11.8, 11.7, 12.3, 11.7, 12.3, 1…
$ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4…
$ unemploy <dbl> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877, 2709, 2…
Code
glimpse(midwest)
Rows: 437
Columns: 28
$ PID                  <int> 561, 562, 563, 564, 565, 566, 567, 568, 569, 570,…
$ county               <chr> "ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN", "…
$ state                <chr> "IL", "IL", "IL", "IL", "IL", "IL", "IL", "IL", "…
$ area                 <dbl> 0.052, 0.014, 0.022, 0.017, 0.018, 0.050, 0.017, …
$ poptotal             <int> 66090, 10626, 14991, 30806, 5836, 35688, 5322, 16…
$ popdensity           <dbl> 1270.9615, 759.0000, 681.4091, 1812.1176, 324.222…
$ popwhite             <int> 63917, 7054, 14477, 29344, 5264, 35157, 5298, 165…
$ popblack             <int> 1702, 3496, 429, 127, 547, 50, 1, 111, 16, 16559,…
$ popamerindian        <int> 98, 19, 35, 46, 14, 65, 8, 30, 8, 331, 51, 26, 17…
$ popasian             <int> 249, 48, 16, 150, 5, 195, 15, 61, 23, 8033, 89, 3…
$ popother             <int> 124, 9, 34, 1139, 6, 221, 0, 84, 6, 1596, 20, 7, …
$ percwhite            <dbl> 96.71206, 66.38434, 96.57128, 95.25417, 90.19877,…
$ percblack            <dbl> 2.57527614, 32.90043290, 2.86171703, 0.41225735, …
$ percamerindan        <dbl> 0.14828264, 0.17880670, 0.23347342, 0.14932156, 0…
$ percasian            <dbl> 0.37675897, 0.45172219, 0.10673071, 0.48691813, 0…
$ percother            <dbl> 0.18762294, 0.08469791, 0.22680275, 3.69733169, 0…
$ popadults            <int> 43298, 6724, 9669, 19272, 3979, 23444, 3583, 1132…
$ perchsd              <dbl> 75.10740, 59.72635, 69.33499, 75.47219, 68.86152,…
$ percollege           <dbl> 19.63139, 11.24331, 17.03382, 17.27895, 14.47600,…
$ percprof             <dbl> 4.355859, 2.870315, 4.488572, 4.197800, 3.367680,…
$ poppovertyknown      <int> 63628, 10529, 14235, 30337, 4815, 35107, 5241, 16…
$ percpovertyknown     <dbl> 96.27478, 99.08714, 94.95697, 98.47757, 82.50514,…
$ percbelowpoverty     <dbl> 13.151443, 32.244278, 12.068844, 7.209019, 13.520…
$ percchildbelowpovert <dbl> 18.011717, 45.826514, 14.036061, 11.179536, 13.02…
$ percadultpoverty     <dbl> 11.009776, 27.385647, 10.852090, 5.536013, 11.143…
$ percelderlypoverty   <dbl> 12.443812, 25.228976, 12.697410, 6.217047, 19.200…
$ inmetro              <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0…
$ category             <chr> "AAR", "LHR", "AAR", "ALU", "AAR", "AAR", "LAR", …

Part 1: Simple bar plot

The week 10 slides describe bar charts as an idiom for one key variable with a count or other summary. Start with a simple frequency display.

  1. Using gss_cat, create a bar plot showing the number of respondents in each level of marital.
  2. Write the plot in whatever style is most natural to you.
  3. Add a title and axis labels.
  4. Apply a non-default theme such as theme_bw() or theme_classic().
  1. Rebuild the same bar plot so that the bars are ordered from most common to least common.
  2. Briefly explain why reordering may improve interpretation.

Part 2: Stacked bar chart

Stacked bars are useful when one categorical variable defines position and a second defines composition within each bar.

  1. Using gss_cat, create a stacked bar chart with x = race and fill = marital.
  2. Use the raw data so that geom_bar() counts cases automatically.
  3. Add readable labels and a legend title.
  1. Create a second version using position = "fill" so that all bars have equal height.
  2. In 1-2 sentences, explain what comparison becomes easier in the normalized version and what comparison becomes harder.

Part 3: Dodged bar chart

The week 10 slides note that dodged bars support easier comparison of aligned heights within each group than stacked bars do.

  1. Using the same variables as above (race and marital), create a dodged bar chart.
  2. Make sure the bars are side-by-side rather than stacked.
  3. Add labels and a theme.
  1. Write 1-2 sentences comparing the dodged version to the stacked version. Which comparisons are easier in each plot?

Part 4: Histogram

Histograms are idioms for one continuous variable. In the General Social Survey, age is a natural candidate.

  1. Create a histogram of respondent age from gss_cat.
  2. Choose a reasonable number of bins or a reasonable binwidth.
  3. Add a vertical reference line at the median age using geom_vline().
  4. Label the plot clearly.
  1. Create a second histogram of age faceted by race.
  2. Briefly comment on whether the age distributions appear similar or different across groups.

Part 5: Boxplot

Boxplots combine a categorical x variable with a continuous y variable. They are useful for comparing medians, spread, and outliers across groups.

  1. Filter gss_cat to remove missing values on tvhours and marital.
  2. Create a boxplot of tvhours by marital.
  3. Consider using coord_flip() if the labels are crowded.
  4. Add a title and informative axis labels.
  1. Create a second version that overlays jittered points or uses partial transparency to reveal individual observations.
  2. In 1-2 sentences, explain what the points add beyond the boxplot summary.

Part 6: Scatterplot

Scatterplots are idioms for two continuous variables. The midwest dataset contains many social-demographic variables that work well here.

  1. Use midwest to create a scatterplot with x = percollege and y = percbelowpoverty.
  2. Add point transparency to deal with overplotting.
  3. Map state to color.
  4. Add a smooth trend line using stat_smooth(se = FALSE).
  1. Add a rug plot to show marginal densities along the axes.
  2. Write 1-2 sentences describing the direction of association between college education and poverty across counties.

Part 7: Line plot

Line plots are idioms for ordered x variables, especially time. Use the economics dataset to show change over time.

  1. Create a line plot of unemploy across date.
  2. Add a descriptive title and axis labels.
  3. Format the y-axis so the labels are easy to read.
  1. Create a second line plot showing uempmed across date.
  2. Use coord_cartesian() to zoom in on dates from January 1, 1970 through December 31, 2010.
  3. Add an annotation using annotate() that draws attention to a notable peak or sustained increase.

Part 8: Dot plot

Dot plots are often a strong alternative to bars when the goal is to compare summarized values across categories. They rely on aligned position, which is perceptually efficient.

  1. Create a summary dataset from gss_cat that contains the mean tvhours for each level of relig.
  2. Remove missing tvhours values before summarizing.
  3. Arrange the summary from lowest mean to highest mean.
  1. Use that summary dataset to build a dot plot of mean TV hours by religion.
  2. Use geom_point() and optionally geom_segment() to create a lollipop-style display.
  3. Flip the coordinates if that improves readability.
  4. Add clear labels.
  1. In 1-2 sentences, explain why a dot plot may be preferable to a bar chart for this kind of summarized comparison.

Part 9: Reflection on idiom choice

Answer the following briefly.

  1. Which of the eight idioms felt most natural to build in ggplot2?
  2. Which idiom required the most data wrangling before plotting?
  3. Give one example from this assignment where changing the position, facet, or coordinate system substantially changed the interpretation of the plot.