# Interactive Box plot and Jitter with R

A box plot is an excellent chart to help quickly visualize the shape of our data points distribution and to detect outliers. Nevertheless, the interpretation of the box plot could easily confuse and mislead any audience; and one way to overcome this downside is to combine a box plot with a jitter.
In this tutorial, we will show you how to create an effective R visualization by combining a box plot and jitter.

Remark

• All the charts are created using the R library Highcharter, so basic knowledge of R is highly recommended to follow along.
• The dev version of the Highcharter library (0.7.0.9001) is used in this tutorial as the jitter chart type is not yet integrated into the production version. This article is released with the Highcharter library version 0.7.0.
• The data used is the same data set as in the previous javascript tutorial Small multiple and box-plot.

Let’s get started 🙂

Basically, a box plot displays statistical data of a large number of points in only five main values: maximum value, third quartile, median, first quartile, and the minimum value. If the data set has outliers, a box plot displays them as well.
The demo below displays the 2012 Olympic athletes‘ heights of four different disciplines: Canoe, Gymnastics, Hockey, and Modern Pentathlon (see demo below):

Here is the code in R:

```install.packages("devtools")
library(devtools)
devtools::install_github("jbkunst/highcharter")
library("highcharter")
library("dplyr")

packageVersion("highcharter")

df <-
"https://raw.githubusercontent.com/mekhatria/demo_highcharts/master/Olympics2012CapitalLetter.csv"
)
#Remove the unnecessary data such as nationality, date of birth, name, and age
df = subset(df, select = -c(nationality, date_of_birth, name, age))
# Compare the data set using the descipine name and sex
my_data <- df %>% filter((sport == "Gymnastics" &
sex == "male")  |
(sport == "Canoe" &
sex == "male") |
(sport == "Hockey" &
sex == "male") |
(sport == "Modern Pentathlon" & sex == "male")
)
#Remove the redundant data
my_data = subset(my_data, select = -c(sex))
#Create the chart
hcboxplot(
outliers = FALSE,
x = my_data\$height,
var = my_data\$sport,
name = "Length"
) %>%
hc_title(text = "Male height by descipline (Olympic 2012)") %>%
hc_yAxis(title = list(text = "Height in metre")) %>%
hc_chart(type = "column")```

From the chart above, we can observe the following:

• The data of the Gymnastics and Hockey disciplines represent a normal distribution since the median is relatively at the center of the box plot, and the whiskers are about the same length. The whiskers are the fine lines between the extremities to the box.
• The data of the Canoe discipline is slightly skewed to the top.
• The data of the Modern Pentathlon discipline is skewed as its median is close to the bottom.
• The Modern Pentathlon male heights are closer to each other compared to the other two disciplines.

So far, the box plot allows us to get some statistical information and data comparison in a very neat and quick visualization. Nevertheless, what a box plot doesn’t show is the data points on the chart. Without the data points, the chart’s interpretation could easily mislead any audience that might associate the length of the box plot with the size of the data points, or two similar box plot sizes would have the same number of data points, etc. In other words, we are not able to answer the following questions:

1. What is the size of the samples?
2. How are the samples scattered on the chart?
3. How many outliers in each data sets?
4. How many points are there in each min or max extremities?
5. What is the density of the data?
6. As the shape of the Gymnastics and Hockey box plots look alike, does it mean they have the same size samples?

One way to compensate for the disadvantages of a box plot is to add jitter. A jitter added to a plot box displays the density and the size of the data points. The chart below displays the same data as the previous chart with a box plot and a jitter:

Here is the code in R:

```#Load the data
df <-
"https://raw.githubusercontent.com/mekhatria/demo_highcharts/master/Olympics2012CapitalLetter.csv"
)
#Remove the unnecessary data such as nationality, date of birth, name, and age
df = subset(df, select = -c(nationality, date_of_birth, name, age))
# Comparing two data set using the sport name and sex
my_data <- df %>% filter((sport == "Gymnastics" &
sex == "male")  |
(sport == "Canoe" &
sex == "male") |
(sport == "Hockey" &
sex == "male") | (sport == "Modern Pentathlon" & sex == "male")
)
#Remove the redundant data
my_data = subset(my_data, select = -c(sex))
#Create the chart
hcboxplot(
x = my_data\$height,
var = my_data\$sport,
name = "Length",
color = "#2980b9",
outliers = TRUE
) %>%
hc_chart(type = "column") %>%
hc_title(text = "Male height by descipline (Olympic 2012)") %>%
hc_yAxis(title = list(text = "Height in metre")) %>%
data = my_data,
type = "scatter",
hcaes(x = "sport", y = "my_data\$height", group = "sport")
) %>%
hc_plotOptions(scatter = list(
color = "red",
marker = list(