Get your data ready for charts with Python

Arrow containing different data figures


Python is my data analysis language of choice. After using ParseHub to collect huge databases of JSON from online sources, I then use Python’s great libraries like pandas and matplotlib to get the data ready to visualize. In this tutorial, I’ll give you a few examples of how you can use Python’s datetime library to convert extracted dates and times into a format that is understood by your computer. With this tool you will be able to make visuals that change over the past, present, and future.

This example will use data from the table found on this Wikipedia page, which includes the date that each one of the current members joined the United Nations. I will show you how to get it ready for visualizing by converting all of the strings in the date columns into datetimes.

I scraped the table using ParseHub, which allows me to download the extracted data as JSON.

The next step is to import the libraries you will need into Jupyter notebook.

import math
import pandas as pd
import pylab as pl
import numpy as np
import json
import datetime

Python’s pandas library has a function read_json to import JSON into a pandas data structure. I used it to first import the data oriented as one column:

data = pd.read_json(‘UN_members.json', orient='columns')

Next, each cell will be read as an index:

data = pd.read_json(data['states'].to_json(), orient='index')

For this example, I would like to calculate how long each country has been a member of the UN, so I add a blank column called “time” to my DataFrame.

time = pd.Series(None, index = data.index, name = 'time') data = data.join(time)

You can’t calculate anything with the dates found in the “admission” column just yet, because they are still being stored as strings. Luckily, pandas has a function to_datetime that will be able to convert them to a computer readable datetime variable, as long as you tell it what the format of each date is.

In this case, the format is the day, followed by a space, then the name of the month, then another space, then theyear. Using this reference, I know that the format is written: “%d %B %Y”

I’ll first find today’s date with the datetime today function. Then will use a for loop, which will convert each date string into datetimes and calculate the number of days it has been since that date, one index at a time. Indexing a pandas DataFrame can be done in many different ways, but for this loop I used: dataframe.loc[index, column name]

for i in data.index :

    data.loc[i, 'admission'] = pd.to_datetime(data.loc[i, 'admission'], format="%d %B %Y")
    data.loc[i, 'time'] = date - data.loc[i, 'admission']

I then sorted my data frame by length of membership, found in the “time” column, and alphabetical order of the country names.

data = data.sort_values(['time', 'name'])

Now my data is ready to be exported, analyzed and visualized! I decided to visualize this particular data as a highly interactive map using HighChart’s Highmaps.

One way to associate a country to its location on a map using Highmaps is to use its ISO country code. I found a set of data from one of the Highmaps examples that had the name of every country along with their associated country code.

I imported this data as a list, converted it to json using python’s dumps function and then to a pandas DataFrame using the read_json function.

# converts lists into json
country_codes = json.dumps(country_codes)
# converts json into dataframe
country_codes = pd.read_json(country_codes)

In the UN DataFrame, I had to change some of the country names, because the two sources didn’t always use the exact same names. There could have been an elegant way to do this with regex, but I simply hard coded the changes.

data.loc[data['name'] == 'East Timor', 'name'] = 'Timor-Leste'
data.loc[data['name'] == 'Macedonia', 'name'] = 'Macedonia, FYR'
data.loc[data['name'] == 'Slovakia', 'name'] = 'Slovak Republic'
# etc...

After that I was able to join the two DataFrames, giving me one with the time data as well as the country codes.

joined = pd.merge(data, country_codes, how='inner', on='name')

All I need for the visualisation is country code and times, so I selected those columns and then changed the column names. I changed their names to “hc-key” and “value”.

new_data = pd.concat([joined['code'].str.lower(), joined['time']], axis=1)
new_data.columns = ['hc-key', 'value']

Using the datetime days function, I was able to turn each datetime object into an integer number of days.

for i in new_data.index:   
    new_data.loc[i, 'value'] = new_data.loc[i, 'value'].days

I exported the DataFrame as JSON into my python directory to save it.

export = new_data.to_json('UN_data.json', orient='records')

From there it was ready to bring into a Highmaps visual. I copied and pasted the saved JSON, but it could have just as easily been imported from a URL.

I gave my map navigation buttons bottom left and a color center bottom center of the chart.

It also has a hover state, so that the countries change color when hovered over, as well as a tool tip that explicitly states the number of days. You can see the full code in the JavaScript tab below and see how easily it comes together.

Of course, this just displays the number of days until the day the data is extracted. ParseHub allows you to schedule runs every day so that your data stays up to date. The data can also be collected using ParseHub’s API options using HTTP GETs to make integration onto your website completely automated.

This example is just one of the many ways that plotting over time can be made possible with datetimes, but every project is a little different – let me know which date and time data you visualize!