Histograms: When, Why & How (Part 2)

In part 1 of the Histograms post, we talked about what a Histogram can do for you, why you might want to use them, and looked at some examples of Histograms that gave us some insight into the data that they represented.
In this post, we’ll move on to when and how to implement Histograms, using Highcharts.

WHEN TO USE A HISTOGRAM

Histograms are useful when exploring any set of continuous numeric data, with enough data points to make useful inferences.

How many data points are enough? That’s a topic of debate, with many suggestions ranging between 50 and 100 data points as a minimum size.

The bottom line is, if you have too few data points, a Histogram isn’t going to tell you anything useful. There is no danger in plotting a Histogram with too few data points – unless you try to base statistical conclusions on the result!

BUILDING A HISTOGRAM IN HIGHCHARTS

One question that has been asked many times is how to go about creating a Histogram with Highcharts. Highcharts is a great tool for displaying Histograms, with a wide variety of options available to format the chart as needed.

FORMATTING

On the formatting side, a Histogram is just a column chart. The primary difference is that gaps between columns are removed in a Histogram. This is done for one simple reason: each part of the x axis is a bin, and each bin covers part of a continuous range along the axis.
If there is a range on the x axis with no column, that means that no data from the dataset fell into that range.
Removing gaps between the bars makes it clear that each bin starts where the previous bin ended, and if a gap exists, it means something about your data.

DATA PROCESSING

But what about the data?

Users are often looking to the charting library to do the work – they have a dataset, and want the chart to process and bin the data into a Histogram

But Highcharts is not a statistical analysis tool, it’s a charting tool! Sometimes the line between those things is blurry, but honestly I would prefer that a library sticks to what it does best, and does it well, than that it tries to do everything.

Fortunately, processing data to build a Histogram is not very difficult, and we can build a Javascript function to do this for us.
Included here is a function that I use when I need to process the data on the client side.

When possible, I process the data before it ever gets to the client, using PHP, Python, R, or whatever is available in the setup at hand. But sometimes, you need to do it client side, and as long as the dataset is not incredibly large, this is not a problem. If you are new to Histograms, follow through the code and explanation below to understand the mechanics of this kind of data preparation.
The function:

function binData(data) {

  var hData = new Array(), //the output array
  	size = data.length, //how many data points
  	bins = Math.round(Math.sqrt(size)); //determine how many bins we need
       bins = bins > 50 ? 50 : bins; //adjust if more than 50 cells
  var max = Math.max.apply(null, data), //lowest data value
  	min = Math.min.apply(null, data), //highest data value
  	range = max-min, //total range of the data
  	width = range/bins, //size of the bins
  	bin_bottom, //place holders for the bounds of each bin
  	bin_top;

  //loop through the number of cells
  for(var i = 0; i < bins; i++) {

	//set the upper and lower limits of the current cell
	bin_bottom = min + (i * width) ;
	bin_top = bin_bottom + width;

	//check for and set the x value of the bin
	if(!hData[i]) {
  	    hData[i] = new Array();
    	    hData[i][0] = bin_bottom + (width / 2);
	}

	//loop through the data to see if it fits in this bin
	for(var j = 0; j < size; j++) {
  	    var x = data[j];

  	    //adjust if it's the first pass
  	    i == 0 && j == 0 ? bin_bottom -= 1 : bin_bottom = bin_bottom;

  	    //if it fits in the bin, add it
  	    if(x > bin_bottom && x <= bin_top) {
    	        !hData[i][1] ? hData[i][1] = 1 : hData[i][1]++;     	 
  	    }
	}
  }
  //cleanup
  $.each(hData, function(i, point) {
	if(typeof point[1] == 'undefined') {
  	hData[i][1] = null;
	}
  });
  return hData;
}

And a Fiddle to experiment with:  http://jsfiddle.net/jlbriggs/gud4bp66/

FIRST, WHAT THIS FUNCTION DOES NOT DO:

It is not a fully robust, error-proof function. It’s a quick-and-dirty example that can serve as a useful tool in conjunction with adequate safeguards around the data being sent to it, or that can serve as a foundation to build a more robust function or class.

WHAT THE FUNCTION DOES DO:

First the function determines the size of the data – how many data points do we have –  and the number of bins to create.
Histograms are all about the bins.
While determining the number of bins is a topic with a wide variety of opinions, a good rule of thumb is to use the square root of the number of data points.

The function puts a hard stop at 50 bins, however, as beyond a certain number of bins, a Histogram can often be less useful. 50 is an arbitrary number, based on observational experience – set it to any number that makes sense for your data, or remove the limit altogether if you wish.

Next the function determines the range of the data set, and uses the range and the number of bins to determine the width that each bin needs to be.

All bins are the same width. You will find people who advocate for, or wish for, bins to be different sizes. I strongly caution against this, because it 1) causes unnecessary complexity in the chart, which forces the user to work harder to decode the information, and 2) is not usually based on anything statistically valid. Further arguments for or against this assertion are beyond the scope of this post!
Once we have these variables, we can loop through the bins, loop through the data, and pull each data point into the bin where it belongs.
The output returned is a ready to use array, in the form of an array of x,y pairs.

A FINAL EXAMPLE

This is a Histogram for a dataset containing the weights of cartons that shipped from a distribution center, containing a little more than 33,000 data points.

Once again, we notice several important features of this data immediately:

  • It is definitely not normally distributed
  • It is zero-bounded on the left
  • It is right-skewed
  • It is multi-modal

In particular, the spike at the 40lb range jumps out immediately. Aside from that point, we have a lot of values clustered around a pretty low threshold, in the 2-10lb range, and a steady decrease up into the higher weights, as might be expected for the type of product being sold.
But why so many at 40lbs? Do they have a lot of products that weigh 40 lbs? The answer was no.
But what they did have was a limit on how much weight their software would tell the workers to put into a single carton. So when there are larger orders, that will take more than one carton, the first carton would often reach to, or near, the 40lb mark, and a second carton would be used for what was left.
In that case, why are there any cartons more than 40lbs?  Because some of the items weigh in at more than 40lbs for a single unit.
This is a fairly simple sequence of understanding the data in this dataset, but what it highlights is very important:  A Histogram might not give you the answers that you need, but it will give you the questions.

And there is nothing more important when analysing data than asking the right questions!

SO HOW ABOUT IT?

Are you a reader who has worked with Histograms already? Is the idea new to you?
If you’re new to Histograms, why not try it out? Leave a comment, and let us know how it worked out – post a link, tell a story, ask a question.
f you work with Histograms a lot, let us know what you love about them. Leave a comment with a story of insights you’ve gained through a Histogram that you might not have otherwise, or your own tips on using them.