Visualize Wikipedia Data with NodeJS and Highcharts

Visualizing Wikipedia data wiht NodeJS and Highcharts

 

Wikipedia is a great source of information and data, with a rate of over 10 edits per second. The English Wikipedia alone gets 600 new articles per day. But Wikipedia also offers many tools for exploring pages’ statistics, such as Pageviews Analysis, Wikipedia Ranking, Wikipedia API, etc. And if you are a DataViz enthusiastic like me, this is a treasure trove of data!

In this tutorial, I will show you how to extract and visualize the Pageviews Analysis data using Wikipedia API, NodeJS, and Highcharts.

The good news is that MediaWiki provides an easy and straightforward Wikipedia API, with no need for an API key.

Let’s get started!

I will extract the dates and the users’ views of the Wikipedia webpage International Space Station from 7/1/2017 – 6/3/2018, then plot the trends in an interactive chart (see GIF below):

Remark

You can download the code used in this article from the following Github link.

 

I use the following Wikipedia API structure: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/International_Space_Station/daily/2017070100/2018060300. Notice the use of the name of the page in the 10th field and the dates in the 12th field. For more details about Wikipedia API, click here.

To handle the API call, I use the request-promise package.

First, let’s create a folder to save the code. Browse to the folder you created and install the request-promise package:

npm install --save request
npm install --save request-promise

As I am using the highcharts library, I need to install it as well with this command line:
npm install highcharts

The last package to install is browserify.
npm install browserify

Browserify allows me to compile the whole code (including Highcharts library) into a single js file, that I can include it as a script in the HTML webpage.

I will first display the code (you may copy and paste), run it; then, I will review the code for you.

The code

Create a new js file (ex: code.js), and copy/paste the code below:

var rp = require('request-promise');
var Highcharts = require('highcharts');

var options = {
  method: 'GET',
  uri: 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/International_Space_Station/daily/2017070100/2018060300',
  json: true,
};

rp(options)
  .then((parseBody) => {
    var arrData = [];
    var year, month, day;

    for (i = 0; i < parseBody.items.length; i++) {
      year = parseBody.items[i].timestamp.slice(0, 4);
      month = parseBody.items[i].timestamp.slice(4, 6);
      day = parseBody.items[i].timestamp.slice(6, 8);
      arrData.push([new Date(year + '-' + month + '-' + day).toDateString(), parseBody.items[i].views]);
    }

    year = parseBody.items[0].timestamp.slice(0, 4);
    month = parseBody.items[0].timestamp.slice(4, 6);
    day = parseBody.items[0].timestamp.slice(6, 8);

    // Create the chart    
    Highcharts.chart('container', {
      title: {
        text: 'Views of the International Space Station Wikipedia webpage'
      },
      subtitle: {
        useHTML: true,
        text: 'Source: Wikipedia'
      },
      xAxis: {
        type: 'datetime',
        dateTimeLabelFormats: {
          day: '%y/%b/%e'
        }
      },
      yAxis: {
        title: {
          text: 'Number of views'
        }
      },
      series: [{
        name: 'views',
        data: arrData,
        pointStart: Date.UTC(year, month, day),
        pointInterval: 24 * 3600 * 1000 // one day
      }]
    });
  });

Don’t forget to also create an HTML file (ex: chart.html) then copy/paste the code below:

<html>	
    <head>>	
        <script src="bundle.js"></script>	 
    </head>	
    <body>	
        <div id="container"></div>       
    </body>	
</html>

Run the code

To run the code just execute this command line on the terminal browserify code.js > bundle.js, then click on the HTML file to see the result.

Explanations

I create the Options object that holds all the necessary information to make a request. This route does not require any authentication, so it should be pretty simple.

var options = {
  method: 'GET',
  uri: 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/International_Space_Station/daily/2017070100/2018060300',
  json: true,
};

The object includes:

  • The method/type of the request (GET, POST, PUT, DELETE). In this case, I use GET, as I request data from Wikipedia.
  • The link to the URL represented by uri.
  • The expected datatype from the URL. in this case JSON.

The following code launches the whole fetching data process:

rp(options)
  .then((parseBody) => {
….
});

parseBody holds the data received from fetched from Wikipedia:

...{"project":"en.wikipedia","article":"International_Space_Station","granularity":"daily","timestamp":"2018021700","access":"all-access","agent":"user","views":4549},{"project":"en.wikipedia","article":"International_Space_Station","granularity":"daily","timestamp":"2018021800","access":"all-access","agent":"user","views":4896},{"project":"en.wikipedia","article":"International_Space_Station","granularity":"daily","timestamp":"2018021900","access":"all-access","agent":"user","views":4634},{"project":"en.wikipedia","article":"International_Space_Station","granularity":"daily","timestamp":"2018022000","access":"all-access","agent":"user","views":4701} ...,

The content of the parseBody has many information, but I am only interested in the number of views and the dates. To extract those data I use the following loop:

for (i = 0; i < parseBody.items.length; i++) {
      year = parseBody.items[i].timestamp.slice(0, 4);
      month = parseBody.items[i].timestamp.slice(4, 6);
      day = parseBody.items[i].timestamp.slice(6, 8);

      arrData.push([new Date(year + '-' + month + '-' + day).toDateString(), parseBody.items[i].views]);
    }

Notice that I use three variables to handle the dates: year, month, and day. This is because the dates in Wikipedia are structures as YYYYMMDD. I would have preferred a Unix Timestamp as it is much easier to manage. Oh, well…

Once all those data are extracted, I build the chart using Highcharts:

Highcharts.chart('container', {
      title: {
        text: 'Views of the International Space Station (Wikipedia webpage)'
      },
      subtitle: {
        useHTML: true,
        text: 'Source: Wikipedia'
      },
      xAxis: {
        type: 'datetime',
        dateTimeLabelFormats: {
          day: '%y/%b/%e'
        }
      },
      yAxis: {
        title: {
          text: 'Number of views'
        }
      },
      series: [{
        name: 'views',
        data: arrData,
        pointStart: Date.UTC(year, month, day),
        pointInterval: 24 * 3600 * 1000 // one day
      }]
    });

So that’s how you may visualize Wikipedia Pageviews Analysis, using NodeJS and Highcharts. I really enjoyed setting up this project, as the Wikipedia API is easy to use. I have barely scratched the surface, and I encourage you to play around with the code and API to visualize other data and trends in this amazing collection of data.