Histograms: When, Why & How (Part 1)

Six histograms on a poster hanging on a brick wall

The goal of data analysis is to turn data into actionable information and insight. To do that, you have to first understand what your dataset contains, and what it can tell you (and just as importantly, what it can’t tell you).

Seeing the distribution of your dataset is one of the key ways to begin understanding the data, and how you can work with it. Histograms are a great way to plot distributions, and quickly view the shape of your data. The shape of your data tells you a lot. Is it symmetrical? Right-skewed? Left-Skewed? Zero-bounded? Bimodal? Multi-modal?

Without having to dig into robust statistical analyses, you can quickly and easily gain a foundation in the understanding of your dataset by viewing these basic features, whether you’re measuring process variation, survey responses, demographic data, or weather patterns.

LET’S LOOK AT A COUPLE OF EXAMPLES.

To start, we’ll take a look at data from the well known Fisher’s Iris Data set. To very briefly summarize the data set, it is a collection of measurements of four separate attributes of three distinct species of Iris flower.

Plotting a distribution of the Sepal Width data, we see a fairly symmetrical distribution of data. All of the data points fall relatively evenly on both sides of a central point (with a very slight skew to the right, or positive). This tells us that there is not likely a significant difference in sepal width between the different species measured, and we can do various things with this data, without having to separate by species.

In this Histogram, we plot the Petal Width data from the same Iris data set. Unlike the previous data, this data is very distinctly multi-modal – there are multiple peaks, some drastically different, which tells us very clearly that there are multiple factors affecting the data. If we did not already know that this data set measured three different species of Iris, this Histogram would immediately tell us to take a look at what is causing those three different groups. If we’re going to analyze this data, we’ll need to facet the analysis by species!


Moving on to another data set – this one measuring the weights of chicks which have been fed various diets. We see a number of attributes displayed in this Histogram:

  • This distribution is right-skewed, or skewed to the positive – the majority of values are aligned to the left, with a decreasing number of values stretching further along the x axis
  • The distribution is multi-modal – there is more than one peak present
  • The data is strongly affected by being zero-bounded – the distribution can’t go any further left, or negative, since a chick cannot weigh a negative amount.

Among other things, this distribution tells us that we must facet our analysis by the diet of the chick, or find any other factors that might be causing multiple peaks.

                

WHY IT’S IMPORTANT TO KNOW THESE THINGS ABOUT OUR DATA

We can see in two of the examples above that when our data has a multi-modal distribution, it can be a clear indication that there are multiple factors in the data set that need to be addressed.

If we don’t address these different factors in our data analysis, we are clearly going to miss something, and will likely present inaccurate conclusions about the data.

One of the more important things to understand about your data is whether or not it is normally distributed. A lot of statistical methods are built on the assumption of normally distributed data. If your data is not normally distributed, simple things like mean, and standard deviation, may not helpful indicators – we must use other methods, or transform our data (topics for another day).

That said, a Histogram is not enough to say for sure that your data is normally distributed – you should be running statistical tests on your data to determine that. But a Histogram is a good, fast, simple, visual way to get the ball rolling (present the output of an Anderson–Darling test by itself, and statisticians will understand you; present the output of an Anderson–Darling test along with a Histogram, and your audience immediately becomes much wider).

WHAT NEXT?

So far, we’ve talked about what a Histogram can do for you, why you might want to use them, and looked at some examples of Histograms that gave us some insight into the data that they represented. In Part 2, we’ll move on to when and how to implement Histograms, using Highcharts.