Visualizing the gender of US senators with R and Highmaps

I like interactive visualizations. It allows me, as a number cruncher, unique ways to analyze and play with the data. For the reader, it also makes data more engaging and easy to grasp than words or numbers alone.

I have tried several charting libraries and Highcharts is the one I like the most. Why? It is elegant, very well documented, and above all, provides an extensive set of examples and demos that helps me quickly figure out how to implement my ideas.

My tool of choice for statistical analysis is R. To my great joy, R and Highcharts play really nice together with the help of Highcharter, an open source R wrapper for the Highcharts javascript library and its modules. Highcharter allows R programmers like me to create interactive, web ready charts really easy.

One feature I love in particular, is the ability to generate interactive maps using Highcharter’s built-in support for Highmaps (Highchart’s sister product). In this article, I’ll give you a short example of how to use the map feature to bring geo-data to life.

SOURCING AND PARSING THE DATA

The data for this example is the number of female United States senators in the 114th congress. I sourced the data from here via an XML file.

As you all know, the quality of input data determines output quality. Sometimes the most laborious task may be preparing data for analysis. In this case, the task was pretty straightforward: After extracting the data, I cleaned up the first names a little and then used the Genderize.io API to append gender data to each record. Each gender assignment includes a probability measure, which I used to single out records that could benefit from manual review and correction. (E.g. names like Pat and Rand are gender neutral, and thus neither I nor Genderize have complete confidence in the result…)

Next, I needed to tabulate the number of male and female senators by state (check the code below):

#Load required libraries
library(XML)
library(genderizeR)
library(stringr)
library(dplyr)
library(highcharter)
#Read XML file with senators info from www.senate.gov
url="http://www.senate.gov/general/contact_information/senators_cfm.xml"
data=xmlToDataFrame(url)
#Create names variable with cleaned first names and genderize using genderize.io API
data %>% 
  mutate(name=str_to_lower(word(first_name, 1))) %>%
  select(name) %>% 
  findGivenNames() -> names
#Join output with original data
data %>% select(first_name, last_name, member_full, state) %>% 
  mutate(name=str_to_lower(word(first_name, 1))) %>% 
  mutate(name=str_replace_all(name,"[[:punct:]]","")) %>% 
  na.omit() %>% left_join(names, by="name")->data
#Inspect cases with low probability
data %>% filter(probability < 0.75)
data %>% mutate(gender = ifelse(name %in% c("rand", "pat"), "male", gender))->data
#Summarize data by state
data %>% 
  group_by(state, gender) %>% 
  summarise(senators=n()) %>% 
  tidyr::spread(gender, senators) %>% 
  ungroup()->data
data[is.na(data)] <- 0
#Load geojson with US states boundaries
data("usgeojson")
#Map colors
colfunc <- colorRampPalette(c("white", "darkviolet"))
n=max(data$female)
#Assign one color to each output (# of female senators)
colstops <- data.frame(q = 0:n, c = colfunc(n+1)) %>%
  list.parse2()
#Create highmap with the previous info using highcharter package
highchart() %>%
  hc_title(text = "Current Women Senators of the 114th Congress") %>%
  hc_subtitle(text = "Source: http://www.senate.gov/") %>%
  hc_add_series_map(usgeojson, data, name = "Women Senators",
                    value = "female", joinBy = c("postalcode", "state"),
                    dataLabels = list(enabled = TRUE,
                                      format = '{point.properties.postalcode}')) %>%
  hc_colorAxis(stops = colstops) %>%
  hc_mapNavigation(enabled = TRUE)->m
#Save map to local
library(htmlwidgets)
saveWidget(m, file="m.html")

Next I join my freshly prepared gender data with a geojson dataset of United States (1), which gives me all the data I need to generate the map visualization and assign values to each state.

The result is what you see below. You may hover each state to see additional data.

If you are a statistician (or data-scientists, as the fashionable term is), realize that you are a story-teller. As such, data visualization is perhaps your most important vehicle for conveying information, not just make data easier to analyze for your own needs. If data visualization is not a key part of your analysis workflow, it’s time to pick up some new tricks!

If you like R, you’ll love Highcharter and Highcharts. Give it a try.

(Editor’s note: There is currently a promotion with a steep discount on commercial use of Highcharts and Highmaps in conjunction with Highcharter. Click here to apply.)