Interactive Box plot and Jitter with R


Blog Posts Data Journalism Data Science Highcharts R Tutorials0 comments

Featured image







A box plot is an excellent chart to help quickly visualize the shape of our data points distribution and to detect outliers. Nevertheless, the interpretation of the box plot could easily confuse and mislead any audience; and one way to overcome this downside is to combine a box plot with a jitter.
In this tutorial, we will show you how to create an effective R visualization by combining a box plot and jitter.

Remark

  • All the charts are created using the R library Highcharter, so basic knowledge of R is highly recommended to follow along.
  • The dev version of the Highcharter library (0.7.0.9001) is used in this tutorial as the jitter chart type is not yet integrated into the production version. This article is released with the Highcharter library version 0.7.0.
  • The data used is the same data set as in the previous javascript tutorial Small multiple and box-plot.

 

Let’s get started 🙂

Basically, a box plot displays statistical data of a large number of points in only five main values: maximum value, third quartile, median, first quartile, and the minimum value. If the data set has outliers, a box plot displays them as well.
The demo below displays the 2012 Olympic athletes‘ heights of four different disciplines: Canoe, Gymnastics, Hockey, and Modern Pentathlon (see demo below):

Here is the code in R:

install.packages("devtools")
library(devtools)
devtools::install_github("jbkunst/highcharter")
library("highcharter")
library("dplyr")
library(readr)

packageVersion("highcharter")

#Load the data
df <-
  read_csv(
    "https://raw.githubusercontent.com/mekhatria/demo_highcharts/master/Olympics2012CapitalLetter.csv"
  )
#Remove the unnecessary data such as nationality, date of birth, name, and age
df = subset(df, select = -c(nationality, date_of_birth, name, age))
# Compare the data set using the descipine name and sex
my_data <- df %>% filter((sport == "Gymnastics" &
                   sex == "male")  |
                  (sport == "Canoe" &
                     sex == "male") |
                  (sport == "Hockey" &
                     sex == "male") |
                  (sport == "Modern Pentathlon" & sex == "male")
  )
#Remove the redundant data
my_data = subset(my_data, select = -c(sex))
#Create the chart
hcboxplot(
  outliers = FALSE,
  x = my_data$height,
  var = my_data$sport,
  name = "Length"
) %>%
  hc_title(text = "Male height by descipline (Olympic 2012)") %>%
  hc_yAxis(title = list(text = "Height in metre")) %>%
  hc_chart(type = "column")

From the chart above, we can observe the following:

  • The data of the Gymnastics and Hockey disciplines represent a normal distribution since the median is relatively at the center of the box plot, and the whiskers are about the same length. The whiskers are the fine lines between the extremities to the box.
  • The data of the Canoe discipline is slightly skewed to the top.
  • The data of the Modern Pentathlon discipline is skewed as its median is close to the bottom.
  • The Modern Pentathlon male heights are closer to each other compared to the other two disciplines.

So far, the box plot allows us to get some statistical information and data comparison in a very neat and quick visualization. Nevertheless, what a box plot doesn’t show is the data points on the chart. Without the data points, the chart’s interpretation could easily mislead any audience that might associate the length of the box plot with the size of the data points, or two similar box plot sizes would have the same number of data points, etc. In other words, we are not able to answer the following questions:

  1. What is the size of the samples?
  2. How are the samples scattered on the chart?
  3. How many outliers in each data sets?
  4. How many points are there in each min or max extremities?
  5. What is the density of the data?
  6. As the shape of the Gymnastics and Hockey box plots look alike, does it mean they have the same size samples?

One way to compensate for the disadvantages of a box plot is to add jitter. A jitter added to a plot box displays the density and the size of the data points. The chart below displays the same data as the previous chart with a box plot and a jitter:

Here is the code in R:

#Load the data
df <-
  read_csv(
    "https://raw.githubusercontent.com/mekhatria/demo_highcharts/master/Olympics2012CapitalLetter.csv"
  )
#Remove the unnecessary data such as nationality, date of birth, name, and age
df = subset(df, select = -c(nationality, date_of_birth, name, age))
# Comparing two data set using the sport name and sex
my_data <- df %>% filter((sport == "Gymnastics" &
                   sex == "male")  |
                  (sport == "Canoe" &
                     sex == "male") |
                  (sport == "Hockey" &
                     sex == "male") | (sport == "Modern Pentathlon" & sex == "male")
  )
#Remove the redundant data
my_data = subset(my_data, select = -c(sex))
#Create the chart
hcboxplot(
  x = my_data$height,
  var = my_data$sport,
  name = "Length",
  color = "#2980b9",
  outliers = TRUE
) %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Male height by descipline (Olympic 2012)") %>%
  hc_yAxis(title = list(text = "Height in metre")) %>%
  hc_add_series(
    data = my_data,
    type = "scatter",
    hcaes(x = "sport", y = "my_data$height", group = "sport")
  ) %>%
  hc_plotOptions(scatter = list(
    color = "red",
    marker = list(
      radius = 2,
      symbol = "circle",
      lineWidth = 1
    )
  ))  %>%
  hc_plotOptions(scatter = list(jitter = list(x = .1, y = 0)))

Thanks to the jitter data points on this demo, there is no doubt that the Canoe and Hockey disciplines have the largest number of samples among the discipline visualized, whereas Modern Pentathlon has the smallest number of samples. Despite that, the Gymnastics and Hockey disciplines have almost the same box plot shape; their data points sizes are different. Another good advantage of adding a jitter is the ability to visually expose the number of the max, min, and outliers for each discipline.
Well, it looks like adding the jitter helps us to understand the data set better. However, it is worth mentioning that a jitter chart by itself also has its disadvantages, but the combination with the boxplot brings the best from each chart as they compensate flaws of one another. The jitter chart will be explored in another article :).

Now, you know how to create a compelling interactive chart by mixing a box plot and jitter. Let us know your experience with these charts, and feel free to share any comment or question in the section below.

Consent for marketing cookies needs to be given to post comments