ARTICLES
Sankey Plots Using googleVis
Photo: American political parties over time, taken at the National Museum of American History, August 2017
This post was originally written in January 2017, and slightly updated January 2018
Motivation
To visualize the progression of patients over the course of a longitudinal study, I wanted to do a Sankey plot. After trying a few packages, I ended up using the googleVis package. This was my first experience with Google Charts, which works quite differently than most R graphics tools. I’ll use a couple of example data sets to demonstrate the process, along with pros and cons.
What’s Google Charts/googleVis?
Google Charts is a suite of interactive data visualization tools put together, naturally, by Google. They run on JavaScript; as of the current writing I have essentially zero experience with JavaScript. So trust me, if I can use them, so can you!
Because Google Charts are interactive, they need to be opened in a web browser with internet access available.
googleVis is an R package developed to allow R users to work directly with Google Charts. As with Google Charts itself, the documentation is detailed and there are several vignettes, which is incredibly helpful. Many, many thanks to the package authors, Markus Gesmann and Diego de Castillo.
Example 1: US Immigration Data, 2015
We’ll use some US immigration data from data.world on the countries and general admission classes of those who received United States permanent residencies (green cards) in 2015. I downloaded the data here; more information on the data set can be found here.
Data Import & Management
We’ll look at only regions (not individual countries) and combine the two types of family-based immigrant visas.
## Set knitr global chunk options
knitr::opts_chunk$set(message = FALSE)
library(readxl)
library(tidyverse)
## Looking at original spreadsheet, sheet Table10 contains the data we want
immigration <- read_excel("dhs.xlsx", sheet = "Table10")
## Rename columns - we don't want spaces or special characters
names(immigration) <- c(
'Region', 'Total', 'Family_Sponsored', 'Employment',
'Immediate_Relatives', 'Diversity', 'Refugees', 'Other'
)
## Vector of regions we want to look at
regions <- c(
'Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America',
'Unknown'
)
## Function to strip commas out of values and change to numeric -
## we'll use this in our pipeline
comma_num <- function(x){
as.numeric(gsub(',', '', x))
}
## Data management pipeline
immigration <- immigration %>%
## Keep only the regions we want (vs individual countries)
filter(Region %in% regions) %>%
## Take out the duplicate row for Unknown
unique() %>%
## Make all columns numeric that should be numeric
mutate(Family_Sponsored = comma_num(Family_Sponsored),
Employment = comma_num(Employment),
Immediate_Relatives = comma_num(Immediate_Relatives),
Diversity = comma_num(Diversity),
Refugees = comma_num(Refugees),
Other = comma_num(Other),
Family_Preference = Family_Sponsored + Immediate_Relatives) %>%
## Take out Total and original family preference classes
select(-Total, -Family_Sponsored, -Immediate_Relatives)
## Check out the resulting data set
immigration
## # A tibble: 7 x 6
## Region Employment Diversity Refugees Other Family_Preference
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 4813 19659 27104 480 49359
## 2 Asia 89533 14562 67603 8842 238757
## 3 Europe 20945 11425 3317 332 49784
## 4 North America 18238 667 51535 16888 278798
## 5 Oceania 1186 721 37.0 19.0 3441
## 6 South America 9296 878 2346 1473 58316
## 7 Unknown 36.0 22.0 53.0 43.0 523
Formatting Data
We want the source (lefthand) nodes to be the region, and the target (righthand) nodes to be the immigration class. Source nodes must be in the first column, target nodes must be in the second column, and weights (in our case, counts) must be in the third column.
sankey_immigration <- immigration %>%
gather(key = Admission_Class, value = Residents, -Region)
sankey_immigration
## # A tibble: 35 x 3
## Region Admission_Class Residents
## <chr> <chr> <dbl>
## 1 Africa Employment 4813
## 2 Asia Employment 89533
## 3 Europe Employment 20945
## 4 North America Employment 18238
## 5 Oceania Employment 1186
## 6 South America Employment 9296
## 7 Unknown Employment 36.0
## 8 Africa Diversity 19659
## 9 Asia Diversity 14562
## 10 Europe Diversity 11425
## # ... with 25 more rows
Create Immigration Plot
Now let’s make the plot, using all the default settings. gvis_immigration
is the object created by googleVis::gvisSankey()
. If we plot(gvis_immigration)
, the plot will open in a new browser tab; here, we’ll print(gvis_immigration)
so the plot will be shown directly in our HTML document. We’ll also set tag = 'chart'
so that only code for the chart itself will be extracted, versus enough HTML to produce a whole separate web page. (If you forget to do this, the multiple sets of <html>
tags can make for some wacky results. Ask me how I know.)
library(googleVis)
## Create gvis object
gvis_immigration <- gvisSankey(sankey_immigration,
from = 'Region',
to = 'Admission_Class',
weight = 'Residents')
## Print gvis object - chart only, not full HTML
print(gvis_immigration, tag = 'chart')
We can quickly see that most green card recipients in 2015 came from either North America or Asia, and the majority of them had one of the two kinds of family preference visas. There were about as many new permanent residents in 2015 who entered the US as refugees/asylees as entered with employment visas.
The defaults look pretty good, but what if I have some specific requirements?
Example 2: Longitudinal Study Progression
We’ll focus here on customizing the result when describing patient progression in a study I was involved in (raw data here, page 17). I put the raw data in a Google Sheet and used the googlesheets
package by Jenny Bryan and Joanna Zhao to read in the data.
library(googlesheets)
## Register title of spreadsheet containing data sets
brain_sheets <- gs_title('BRAIN-ICU Patient Progression')
## Read in original raw data; verbose = FALSE suppresses download messages
brain_original <- brain_sheets %>%
gs_read("Original", verbose = FALSE)
## Create gvis object
gvis_brain_defaults <- gvisSankey(brain_original,
from = 'sourceNode',
to = 'targetNode',
weight = 'Patients')
## Print gvis object - chart only, not full HTML
print(gvis_brain_defaults, tag = 'chart')
This is somewhat helpful, but not as helpful as it could be. Can we change the vertical order of the nodes? Could the nodes of the same type - all the deaths, for example - be the same color all the way through the study so they’re easier to track? Could the tooltips be more informative? The answer is, of course, yes!
Reordering the Plot
Let’s say we want the plot to keep the patients with the most involvement - eg, those alive and evaluated at each time point - up at the top, and patients with the least involvement down at the bottom. To do that with googleVis, we need our data frame to be in that order. Here, I just did it manually; it’s stored in the second worksheet of my Google Sheet.
Note on googleVis options
Options are specified using a named list. Some of the pieces are googleVis-generic (height and width, for example, and whether the tooltip includes HTML); others are Sankey-plot-specific and are specified in a string named
sankey
. Note that this needs to be in Javascript format; the Google Charts/googleVis documentation are quite helpful here.
In order for our ordering to work, we need to set the sankey
-specific option for iterations
to 0. This keeps googleVis from using its default algorithm to figure out its own “best” placement for each node. We’ll also use options
to make the plot wider.
## Import reordered data from Google Sheet
brain_reordered <- brain_sheets %>%
gs_read("Reordered", verbose = FALSE)
## Create gvis object
gvis_brain_ordered <- gvisSankey(brain_reordered,
from = 'sourceNode',
to = 'targetNode',
weight = 'Patients',
options = list(height = 400, width = 700,
sankey = "{iterations: 0
}"))
## Print gvis object - chart only, not full HTML
print(gvis_brain_ordered, tag = 'chart')
Informative Tooltips
By default, the tooltips for each edge show the source and target nodes along with the weight (here, the number of patients). But wouldn’t it be good to know, for example, what proportion of patients each weight represents out of the source and target nodes? To do that, we just need to include the text we want as a column in our data frame, with a name that ends in .tooltip
. We can include HTML formatting in these tooltips; we’ll need to specify this in our options
list (see below). For a longer intro, see the Roles googleVis vignette.
## Get total number of patients at each source, target node at each time point
source_ns <- brain_reordered %>%
group_by(sourceNode) %>%
summarise(sourceN = sum(Patients)) %>%
ungroup()
target_ns <- brain_reordered %>%
group_by(targetNode) %>%
summarise(targetN = sum(Patients)) %>%
ungroup()
## Add source/target Ns and create tooltip text specific to source/target nodes
brain_reordered <- brain_reordered %>%
left_join(source_ns, by = 'sourceNode') %>%
left_join(target_ns, by = 'targetNode') %>%
mutate(
to_from = paste0('<b>', sourceNode, ' -> ', targetNode, ':</b>'),
brain.tooltip = ## Target nodes where 100% of patients came from same source:
## N + % (n) of source
ifelse(targetNode %in% c('Eligible for followup',
'Withdrew in hospital',
'Died in hospital',
'Evaluated, 3m',
'Lost to followup, 3m'),
paste0(to_from, '<br>N = ', Patients, '<br>',
round((Patients / sourceN) * 100), '% of ',
sourceN, ' ', tolower(sourceNode)),
## Source nodes where 100% of patients go to same target:
## N + % (n) of target
ifelse(sourceNode %in% c('Withdrew in hospital',
'Died in hospital',
'Died, 3m',
'Withdrew, 3m'),
paste0(to_from, '<br>N = ', Patients, '<br>',
round((Patients / targetN) * 100), '% of ',
targetN, ' ', tolower(targetNode)),
## Otherwise, add N, % (n) of source, and % (n) of target
paste0(to_from, '<br>N = ', Patients,
'<br>', round((Patients / sourceN) * 100),
'% of ', sourceN, ' ', tolower(sourceNode),
'<br>', round((Patients / targetN) * 100),
'% of ', targetN, ' ', tolower(targetNode))))
)
Beautiful Colors
You know what would make this plot even better? Color! Specifically, what if all the nodes and edges for patients who died were the same color, and all the nodes for withdrawals were the same color, and…?
Unlike other Sankey plot packages, unfortunately, you can’t quickly specify that “nodes named this should be blue” - here, it’s a manual process, and it’s not always obvious what order the colors should be specified in to match each node. My technique: Include a manual color specification of the same color for every node, then one by one turn each one bright yellow or something obvious, figure out what color it should be, change that one, and iterate till you’re done. Not ideal, but doable.
We specify colors using the node
piece of the sankey
element of options
. We’ll also use these options to tweak the labels and use color gradients for the edges rather than the default gray.
Final Progression Plot
## #B71C1C = died
## #FF6F00 = withdrew
## #FFECB3 = lost to followup
## #1B5E20 = eligible/evaluated
## #1A237E = enrolled
## Create gvis object
gvis_brain_final <-
gvisSankey(
brain_reordered,
from = 'sourceNode',
to = 'targetNode',
weight = 'Patients',
options = list(height = 400, width = 705,
tooltip = "{isHtml:'True'}",
sankey = "{link: { colorMode: 'gradient' },
node: { colors: ['#1A237E',
'#1B5E20',
'#FF6F00',
'#B71C1C',
'#1B5E20',
'#FFECB3',
'#FF6F00',
'#B71C1C',
'#1B5E20',
'#FFECB3',
'#FF6F00',
'#B71C1C'],
label: { fontSize: 14, bold: true }
},
iterations: 0
}")
)
## Print gvis object - chart only, not full HTML
print(gvis_brain_final, tag = 'chart')
Pretty, right?
Pros, Cons and Other Options
Pros
- Even the defaults look good; the finished product looks really polished without too much effort.
- Interactivity can be quite helpful depending on your context.
- Plots are largely customizable - it’s easy to generate informative tooltips, and easy-to-doable to do other custom configurations.
- HTML format is straightforward to include in RMarkdown documents/notebooks. googleVis objects can also be included in Shiny apps.
Cons
- Plots can’t be opened outside a web browser. No saving to a PDF for manuscript submission.
- Unlike other Sankey plot packages, we can’t specify X/Y placement. (For example, in the longitudinal study above, it might have been nice to have the horizontal distances better represent the time between followup points in our study; to my knowledge, that isn’t possible.)
- Other packages also have easier ways to specify colors and node labels.
Other Packages to Check Out
I looked at both the riverplot (author blog post) and sankey (Github) packages during this process. This is also a nice post on using igraph and rcharts to create a Sankey plot. Depending on what you’re after, one of these might work for you.
Session Details
My R setup at the time this post was written is detailed below.
devtools::session_info()
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-02-11
##
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 cran (@1.1.2)
## base * 3.4.3 2017-12-07 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.0)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.0)
## blogdown 0.4 2017-12-12 CRAN (R 3.4.3)
## bookdown 0.5 2017-08-20 CRAN (R 3.4.1)
## broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## codetools 0.2-15 2016-10-05 CRAN (R 3.4.3)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-09-16 CRAN (R 3.4.1)
## curl 3.1 2017-12-12 cran (@3.1)
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.4 2017-11-09 CRAN (R 3.4.2)
## digest 0.6.14 2018-01-14 cran (@0.6.14)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.0)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## googlesheets * 0.2.2 2017-05-07 CRAN (R 3.4.0)
## googleVis * 0.6.2 2017-01-01 CRAN (R 3.4.0)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
## hms 0.3 2016-11-22 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.18 2017-12-27 cran (@1.18)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.1 2017-11-03 CRAN (R 3.4.2)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.1)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.3)
## openssl 0.9.9 2017-11-10 CRAN (R 3.4.2)
## parallel 3.4.3 2017-12-07 local
## pillar 1.1.0 2018-01-14 cran (@1.1.0)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl * 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 cran (@1.4.3)
## rlang 0.1.6.9003 2018-01-31 Github (tidyverse/rlang@c6747f9)
## rmarkdown 1.8 2017-11-17 cran (@1.8)
## rprojroot 1.3-2 2018-01-03 cran (@1.3-2)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0.9000 2018-01-11 Github (hadley/scales@d767915)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.4.2 2018-01-22 cran (@1.4.2)
## tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.2)
## tidyselect 0.2.3 2017-11-06 CRAN (R 3.4.2)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utf8 1.1.3 2018-01-03 CRAN (R 3.4.2)
## utils * 3.4.3 2017-12-07 local
## withr 2.1.1.9000 2018-01-11 Github (jimhester/withr@df18523)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.16 2017-12-12 cran (@2.1.16)