Hands-on Exercise 4c: Visualing Uncertainty

Author

Vanessa Heng

Published

January 28, 2024

Modified

March 1, 2024

1 Overview

This exercise aims to gain hands-on experience on creating statistical graphics for visualising uncertainty by:

  • plotting statistics error bars by using ggplot2,

  • plotting interactive error bars by combining ggplot2, plotly and DT,

  • creating advanced by using ggdist, and

  • creating hypothetical outcome plots (HOPs) by using ungeviz package.

2 Getting Started

2.1 Installing and loading the packages

For this exercise, the following R packages will be used, they are:

  • tidyverse, a family of R packages for data science processes,

  • plotly for creating interactive plot

  • gganimate for creating animation plot

  • DT for displaying interactive html table

  • crosstalk for for implementing cross-widget interactions (currently, linked brushing and filtering)

#devtools::install_github("wilkelab/ungeviz")
pacman::p_load(ungeviz, plotly, crosstalk,
               DT, ggdist, ggridges,
               colorspace, gganimate, tidyverse)

2.2 Data import

The following dataset is used for this exercise.

exam_data <- read_csv("data/Exam_data.csv")
exam_data
# A tibble: 322 × 7
   ID         CLASS GENDER RACE    ENGLISH MATHS SCIENCE
   <chr>      <chr> <chr>  <chr>     <dbl> <dbl>   <dbl>
 1 Student321 3I    Male   Malay        21     9      15
 2 Student305 3I    Female Malay        24    22      16
 3 Student289 3H    Male   Chinese      26    16      16
 4 Student227 3F    Male   Chinese      27    77      31
 5 Student318 3I    Male   Malay        27    11      25
 6 Student306 3I    Female Malay        31    16      16
 7 Student313 3I    Male   Chinese      31    21      25
 8 Student316 3I    Male   Malay        31    18      27
 9 Student312 3I    Male   Malay        33    19      15
10 Student297 3H    Male   Indian       34    49      37
# ℹ 312 more rows

3 Visualizing the Uncertainty

3.1 Visualizing the uncertainty of point estimates: ggplot2 methods

A point estimate is a single number, such as a mean. Uncertainty, on the other hand, is expressed as standard error, confidence interval, or credible interval.

Let’s plot error bars of maths scores by race by using data provided in exam_data tibble data frame and then display the information in an HTML table format.

Show the code
my_sum <- exam_data %>%
  group_by(RACE) %>%
  summarise(n = n(),
          mean = mean(MATHS),
          sd = sd(MATHS)) %>%
  mutate(se = sd/sqrt(n-1))
knitr::kable(head(my_sum), format = 'html')
RACE n mean sd se
Chinese 193 76.50777 15.69040 1.132357
Indian 12 60.66667 23.35237 7.041005
Malay 108 57.44444 21.13478 2.043177
Others 9 69.66667 10.72381 3.791438
Note
  • group_by() of dplyr package is used to group the observation by RACE,

  • summarise() is used to compute the count of observations, mean, standard deviation

  • mutate() is used to derive the standard error of Maths by RACE, and the output is saved as a tibble data table called my_sum.

Let’s plot the standard error bars of mean maths score by race as shown below.

Show the code
ggplot(my_sum) +
  geom_errorbar(aes(x = RACE, 
                  ymin = mean - se, 
                  ymax = mean + se), 
              width=0.2, 
              colour="black", 
              alpha=0.9, 
              size=0.5) +
  geom_point(aes(x = RACE, y = mean), 
           stat = "identity", 
           color = "red",
           size = 1.5,
           alpha = 1) +
  ggtitle("Standard error of mean maths scores by race")

Important

For geom_point(), it is important to indicate stat = “identity”.

Instead of plotting the standard error bar of point estimates, we can also plot the confidence intervals of mean maths score by race.

Show the code
ggplot(my_sum) +
  geom_errorbar(
            aes(x=reorder(RACE, -mean), 
                ymin = mean - 1.96*se, 
                ymax = mean + 1.96*se), 
            width=0.2, 
            colour="black", 
            alpha=0.9, 
            size=0.5) +
  geom_point(aes
           (x=RACE, y=mean), 
           stat = "identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  labs(title = "95% confidence interval of mean maths scores by race")

Note
  • The confidence intervals are computed by using the formula mean+/-1.96*se.

  • The error bars is sorted by using the average maths scores.

Let’s plot interactive error bars for the 99% confidence interval of mean maths score by race as shown in the figure below.

Show the code
shared_df = SharedData$new(my_sum)

bscols(widths = c(4,8),
       ggplotly((ggplot(shared_df) +
                   geom_errorbar(aes(
                     x = reorder(RACE, -mean),
                     ymin = mean - 2.58*se, 
                     ymax = mean + 2.58*se), 
                     width = 0.2, 
                     colour = "black", 
                     alpha = 0.9, 
                     size = 0.5) +
           geom_point(aes(x = RACE, y = mean, 
                 text = paste("Race:", `RACE`, 
                              "<br>N:", `n`,
                              "<br>Avg. Scores:", round(mean, 2),
                              "<br>95% CI:[", 
                              round((mean - 2.58*se), 2), ",",
                              round((mean + 2.58*se), 2),"]")),
                 stat="identity", 
                 color="red", 
                 size = 1.5, 
                 alpha=1) + 
            labs(x = "Race", y = "Average Scores",
                 title = "99% Confidence interval of average /<br>maths scores by race")) + 
            theme_minimal() + 
            theme(axis.text.x = element_text(
                  angle = 45, vjust = 0.5, hjust=1)), tooltip = "text"), 
       
       DT::datatable(shared_df, 
                     rownames = FALSE, 
                     class="compact", 
                     width="100%", 
                     options = list(pageLength = 10,
                                    scrollX=T), 
                     colnames = c("No. of pupils", 
                                  "Avg Scores",
                                  "Std Dev",
                                  "Std Error")) %>%
         formatRound(columns=c('mean', 'sd', 'se'),
                     digits=2))

3.2 Visualizing the uncertainty of point estimates: ggdist methods

ggdist is an R package that provides a flexible set of ggplot2 geoms and stats designed especially for visualising distributions and uncertainty.

It is designed for both frequentist and Bayesian uncertainty visualization, taking the view that uncertainty visualization can be unified through the perspective of distribution visualization:

  • for frequentist models, one visualises confidence distributions or bootstrap distributions (see vignette(“freq-uncertainty-vis”));

  • for Bayesian models, one visualises probability distributions (see the tidybayes package, which builds on top of ggdist).

stat_pointinterval() of ggdist is used to build a visual for displaying distribution of maths scores by race.

Show the code
exam_data %>%
  ggplot(aes(x = RACE, y = MATHS)) +
  stat_pointinterval(point_interval = "median_qi",
                show.legend = FALSE) +
  labs( title = "Visualising confidence intervals of mean Maths scores",
    subtitle = "Mean Point + Multiple-interval plot")

stat_gradientinterval() of ggdist is used to build a visual for displaying distribution of maths scores by race.

Show the code
ggplot(exam_data, aes(x = RACE, y = MATHS)) +
  stat_gradientinterval(   
    fill = "skyblue",      
    show.legend = TRUE) +                        
  labs(title = "Visualising confidence intervals of mean Maths scores",
    subtitle = "Gradient + interval plot")

Let’ plot using Hypothetical Outcome Plots (HOPs) where users can visualize a set of draws from a distribution, where each draw is shown as a new plot in either a small multiples or animated form.

Show the code
ggplot(data = exam_data, 
       (aes(x = factor(RACE), y = MATHS))) +
  geom_point(position = position_jitter(
    height = 0.3, width = 0.05), 
    size = 0.4, color = "#0072B2", alpha = 0.5) +
  geom_hpline(data = sampler(25, group = RACE), 
              height = 0.6, color = "#D55E00") +
  theme_bw() + 
  labs(x = "Race") +
  transition_states(.draw, 1, 3)