Interactive Chord Diagrams

Preamble

In [1]:
from chord import Chord

Introduction

In a chord diagram (or radial network), entities are arranged radially as segments with their relationships visualised by arcs that connect them. The size of the segments illustrates the numerical proportions, whilst the size of the arc illustrates the significance of the relationships1.

Chord diagrams are useful when trying to convey relationships between different entities, and they can be beautiful and eye-catching.

Get Chord Pro

Click here to get lifetime access to the full-featured chord visualization API, producing beautiful interactive visualizations, e.g. those featured on the front page of Reddit.

chord pro

  • Produce beautiful interactive Chord diagrams.
  • Customize colours and font-sizes.
  • Access Divided mode, enabling two sides to your diagram.
  • Symmetric and Asymmetric modes,
  • Add images and text on hover,
  • Access finer-customisations including HTML injection.
  • Allows commercial use without open source requirement.
  • Currently supports Python, JavaScript, and Rust, with many more to come (accepting requests).

chord pro

The Chord Package

With Python in mind, there are many libraries available for creating Chord diagrams, such as Plotly, Bokeh, and a few that are lesser-known. However, I wanted to use the implementation from d3 because it can be customised to be highly interactive and to look beautiful.

I couldn't find anything that ticked all the boxes, so I made a wrapper around d3-chord myself. It took some time to get it working, but I wanted to hide away everything behind a single constructor and method call. The tricky part was enabling multiple chord diagrams on the same page, and then loading resources in a way that would support Jupyter Notebooks.

You can get the package either from PyPi using pip install chord or from the GitHub repository. With your processed data, you should be able to plot something beautiful with just a single line, Chord(data, names).show(). To enable the pro features of the chord package, get Chord Pro.

The Dataset

The focus for this section will be the demonstration of the chord package. To keep it simple, we will use synthetic data that illustrates the co-occurrences between movie genres within the same movie.

In [2]:
matrix = [
    [0, 5, 6, 4, 7, 4],
    [5, 0, 5, 4, 6, 5],
    [6, 5, 0, 4, 5, 5],
    [4, 4, 4, 0, 5, 5],
    [7, 6, 5, 5, 0, 4],
    [4, 5, 5, 5, 4, 0],
]

names = ["Action", "Adventure", "Comedy", "Drama", "Fantasy", "Thriller"]

Chord Diagrams

Let's see what the Chord() defaults produce when we invoke the show() method.

In [3]:
Chord(matrix, names).show()
Chord Diagram

Different Colours

The defaults are nice, but what if we want different colours? You can pass in almost anything from d3-scale-chromatic, or you could pass in a list of hexadecimal colour codes.

In [4]:
Chord(matrix, names, colors="d3.schemeSet2").show()
Chord Diagram
In [5]:
Chord(matrix, names, colors=f"d3.schemeGnBu[{len(names)}]").show()
Chord Diagram
In [6]:
Chord(matrix, names, colors="d3.schemeSet3").show()
Chord Diagram
In [7]:
Chord(matrix, names, colors=f"d3.schemePuRd[{len(names)}]").show()
Chord Diagram
In [15]:
Chord(matrix, names, colors=f"d3.schemeYlGnBu[{len(names)}]").show()
Chord Diagram
In [9]:
hex_colours = ["#222222", "#333333", "#4c4c4c", "#666666", "#848484", "#9a9a9a"]

Chord(matrix, names, colors=hex_colours).show()
Chord Diagram

Label Styling

We can disable the wrapped labels, and even change the colour.

In [10]:
Chord(matrix, names, wrap_labels=False, label_color="#4c40bf").show()
Chord Diagram

Opacity

We can also change the default opacity of the relationships.

In [11]:
Chord(matrix, names, opacity=0.1).show()
Chord Diagram

Conclusion

In this section, we've introduced the chord diagram and chord package. We used the package and some synthetic data to demonstrate several chord diagram visualisations with different configurations. The chord Python package is available for free using pip install chord.


  1. Tintarev, N., Rostami, S., & Smyth, B. (2018, April). Knowing the unknown: visualising consumption blind-spots in recommender systems. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 1396-1399). 

NDArray Index Arrays and Mask Index Arrays

Preamble

In [2]:
:dep darn = {version = "0.1.15"}
:dep ndarray = {version = "0.13.0"}
:dep itertools = {version = "0.9.0"}
:dep plotly = {version = "0.4.0"}
extern crate ndarray;

use ndarray::prelude::*;
use itertools::Itertools;
use plotly::{Plot, Scatter, Layout, Rgb, NamedColor};
use plotly::common::{Mode, Title, Marker, Line};
use plotly::layout::{Axis};

Introduction

NumPy has many features that Rust's NDArray doesn't have yet, e.g. index arrays and mask index arrays. For example, when we index a one-dimensional array we often use a single integer value to return an element at the corresponding position. That is, given an array example containing ten floating-point value elements

In [3]:
let example = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9];

We would access the third element using

In [4]:
example[2]
Out[4]:
0.3

However, there is more than one way to index an array! We may wish to index an array with another array to select multiple samples at once, e.g. example[[2, 5, 8]] for the third, sixth, and ninth sample. We may also want to index an array using a boolean mask array, allowing us to select samples based on some criteria, e.g. example[example > 0.5] to select all samples greater than $0.5$.

Currently, NDArray doesn't offer an easy way to index an array with another array or with a mask array, but it can still be achieved with some extra work. What follows is an approach for selecting samples using index and mask arrays.

Loading our Dataset

We will continue using the Iris Flower dataset, so we will load it using the darn crate to avoid repetition.

We will only be using our species labels, stored in labels, throughout the rest of this section.

In [5]:
let iris = darn::iris_typed();

The darn::iris_typed() function returns a tuple of type (Array2::<f32>, Vec<String>, Array1::<String>), where the first element is an array containing our iris flower features, the second element a vector of our feature headers, and the final element is an array containing our iris species labels. As always, let's have a quick look at a few samples from our features.

In [6]:
darn::show_frame(&iris.0, Some(&iris.1));
Out[6]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
... ... ... ...
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8

We'll also check the unique elements in our labels.

In [7]:
iris.2.iter().unique().format("\n")
Out[7]:
"Iris-setosa"
"Iris-versicolor"
"Iris-virginica"

To make things easier for ourselves throughout the rest of this section, let's assign the various parts of our dataset to different variables.

In [8]:
let features = iris.0;
let headers = iris.1;
let labels = iris.2;

Index Array

During our analyses, we may encounter the need to select multiple samples from our dataset. For example, we may wish to select the samples at index $0$, $10$, and $20$. Let's output the samples at these indeces for reference.

In [9]:
println!("Sample 0: {:}", features.row(0));
Sample 0: [5.1, 3.5, 1.4, 0.2]
In [10]:
println!("Sample 10: {:}", features.row(10));
Sample 10: [5.4, 3.7, 1.5, 0.2]
In [11]:
println!("Sample 20: {:}", features.row(20));
Sample 20: [5.4, 3.4, 1.7, 0.2]

To return these samples all at once using an array as the index, we can use the ArrayBase::select() function:

Select arbitrary subviews corresponding to indices and and copy them into a new array.

The first parameter for this function is the axis we wish to select along, and the second is an array containing the desired indices.

In [12]:
println!("{:}",features.select(Axis(0), &[0, 10, 20]));
[[5.1, 3.5, 1.4, 0.2],
 [5.4, 3.7, 1.5, 0.2],
 [5.4, 3.4, 1.7, 0.2]]

If we check these against the individually indexed samples above, we can see that it has worked as intended.

Index Mask Arrays

We may also want to use a mask to index our array. To work around this missing feature in NDArray, we can build a boolean mask and then use it to generate an index array. We can also use column-wise boolean operations when considering multiple column masks.

Building Boolean Masks

We can build a boolean mask of the same shape as our array with true values where some condition is met. For example, in NumPy we could do features > 0.5 to create a mask with true where values are over $0.5$, and false elsewhere.

We can do the same with Rust and ndarray.

In [13]:
let mask = features.map(|elem| *elem > 0.5);

Let's take peek at our mask to see how it looks.

In [14]:
darn::show_frame(&mask, Some(&headers))
Out[14]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
true true true false
true true true false
true true true false
true true true false
true true true false
... ... ... ...
true true true true
true true true true
true true true true
true true true true
true true true true

We could also build a mask using our labels. For example, we may want to return true where elements are equal to Iris-virginica.

In [15]:
let mask = labels.map(|elem| elem == "Iris-virginica");
println!("{:}", mask);
[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true]

Indexing with Mask Arrays

Now to build an index array and a mask array simultaneously.

In [16]:
let mut count = -1;
let mut indices = Vec::<usize>::new();
let mask = labels.map(|elem| {
    count += 1;    
    if(elem == "Iris-virginica") { indices.push(count as usize) };
    elem == "Iris-virginica"
    }
);

With this approach, we're iterating through every element in the array we wish to mask, labels. When our criteria is satisfied, i.e. elem == "Iris-virginica", we're pushing the curent index stored in count to a vector named indices. The map transormation itself builds the mask based on the criteria specified.

Let's have a look at the indices returnd from this approach.

In [17]:
indices
Out[17]:
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]

Finally, we can use these indices to select all the samples which belong to the Virginica species.

In [18]:
let virginica = features.select(Axis(0), &indices);
darn::show_frame(&virginica, Some(&headers));
Out[18]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
... ... ... ...
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8

Plotting with Plotly

It's always helpful to visualise what we've achieved. Let's plot the petal width and height for all of our samples, and then plot the same for all samples of the Virginica species in a different colour.

In [19]:
let layout = Layout::new()
    .xaxis(Axis::new().title(Title::new("Length (cm)")))
    .yaxis(Axis::new().title(Title::new("Width (cm)")));

let petal = Scatter::new(features.column(2).to_vec(), features.column(3).to_vec())
    .mode(Mode::Markers)
    .name("Petal (All)")
    .marker(Marker::new().color(Rgb::new(69, 57, 172)).size(12));
    
let petal_v = Scatter::new(virginica.column(2).to_vec(), virginica.column(3).to_vec())
    .mode(Mode::Markers)
    .name("Petal (Virginica)")
    .marker(Marker::new().color(Rgb::new(234, 105, 0)).size(12))
    .line(Line::new().color(NamedColor::White).width(0.5));

let mut plot = Plot::new();

plot.set_layout(layout);
plot.add_trace(petal);
plot.add_trace(petal_v);

darn::show_plot(plot);
Out[19]:

Conclusion

In this section, we had a look at how to build index and mask index arrays for use with NDArray indexing. We demonstrated this with the Iris Flower dataset, which enabled us to produce a nice visualisation showing the clustering of one of the species for two of the features. In the following sections, we'll have a look at how to use index and mask index arrays to modify values in-place.

Coronavirus Time Series Line and Bar Chart

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import plotly
import plotly.graph_objects as go    # for data visualisation
from plotly.subplots import make_subplots

# Optional Customisations
import plotly.io as pio              # to set shahin plot layout
pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(
    legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),
    margin=dict(t=0,r=0,b=0,l=0))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected" # remove when running locally 

Introduction

In this section, we're going to use daily confirmed cases data for COVID-19 in the UK made available at coronavirus.data.gov.uk to create a time series plot. Our goal will be to visualise the number of new cases and cumulative cases over time.

Terms of use taken from the data source

No special restrictions or limitations on using the item’s content have been provided.

Bunny

Visualising the Table

The first step is to read the CSV data into a pandas.DataFrame and display the first five samples.

In [2]:
data = pd.read_csv('https://shahinrostami.com/datasets/coronavirus-cases_latest.csv')
data.head()
Out[2]:
areaType areaName areaCode date newCasesByPublishDate cumCasesByPublishDate
0 nation England E92000001 2020-07-22 519.0 255038
1 nation Northern Ireland N92000002 2020-07-22 9.0 5868
2 nation Scotland S92000003 2020-07-22 10.0 18484
3 nation Wales W92000004 2020-07-22 22.0 16987
4 nation England E92000001 2020-07-21 399.0 254519

Let's filter this data to only include rows where the Area name is England.

In [4]:
data = data[data['areaName']=='England']
data.head()
Out[4]:
areaType areaName areaCode date newCasesByPublishDate cumCasesByPublishDate
0 nation England E92000001 2020-07-22 519.0 255038
4 nation England E92000001 2020-07-21 399.0 254519
8 nation England E92000001 2020-07-20 535.0 254120
12 nation England E92000001 2020-07-19 672.0 253585
16 nation England E92000001 2020-07-18 796.0 252913

This data looks ready to plot. We have our dates in a column named Specimen date, the new daily cases in a column named Daily lab-confirmed cases, and the daily cumulative cases in a column named Cumulative lab-confirmed cases. For this plot, we'll enable a secondary y-axis so that we can present our cumulative cases as a line, and our new cases with bars.

In [6]:
from plotly.subplots import make_subplots

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=data['date'], y=data['cumCasesByPublishDate'],
                         mode='lines+markers',
                         name='Total Cases',
                         line_color='crimson'),
                         secondary_y=True)

fig.add_trace(go.Bar(x=data['date'], y=data['newCasesByPublishDate'],
                     name='New Cases',
                     marker_color='darkslategray'),
                     secondary_y=False)
fig.show()

It's an interactive plot, so you can hover over it to get more information.

Conclusion

In this section, we went on a rather quick journey. This involved loading in the CSV data directly from a web resource, and then plotting lines and bars to the same plot.

Bunny

Unique Array Elements and their Frequency

Preamble

In [2]:
:dep ndarray-csv = {version = "0.4.1"}
:dep ndarray = {version = "0.13.0"}
:dep darn = {version = "0.1.15"}
:dep ureq = {version = "0.11.4"}
:dep itertools = {version = "0.9.0"}
:dep plotly = {version = "0.4.0"}
extern crate csv;
extern crate itertools;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use ndarray_csv::Array2Reader;
use std::str::FromStr;
use itertools::Itertools;
use plotly::{Plot, Bar, Layout};
use plotly::common::{Mode, Title};
use plotly::layout::{Axis};
use std::collections::HashMap;

Introduction

In this section, we're going to take a look at some approaches to determining the unique elements in a column of data and their frequency. This is a common task which we may often apply to categorical data, and it can easily give us a good idea of the balance of categorical values across our dataset.

If you're familiar with Python and NumPy, you may have encountered the numpy.unique()function that returns a list of sorted unique elements of an array. In more recent versions of NumPy, the numpy.unique() function also takes a parameter named return_counts, which when set to True will also return the number of times each unique element appears in the array.

Let's see how we can do the same for our ndarray::Array2.

Loading our Dataset

We will continue using the Iris Flower dataset, so we need to load it into our raw string array first.

In [3]:
let file_name = "Iris.csv";

let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String>= rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

Moving Data to Typed Arrays

We need to convert from String to the desired type, and move our data over to the typed arrays.

In [4]:
let mut features: Array2::<f32> =  Array2::<f32>::zeros((data.shape()[0],0));

for &f in [1, 2, 3, 4].iter() {
    features = ndarray::stack![Axis(1), features,
        data.column(f as usize)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];
};

let feature_headers = headers[1..5].to_vec();
let labels: Array1::<String> = data.column(5).to_owned();

We will only be using our species labels, stored in labels, throughout the rest of this section.

Unique Elements

We can get the unique elements in our array using the unique() function provided by the itertools crate.

Return an iterator adaptor that filters out elements that have already been produced once during the iteration. Duplicates are detected using hash and equality.

We can then iterate through this and output the elements to the output cell.

In [5]:
for i in labels.iter().unique() {
    println!("{}",i);
};
Iris-setosa
Iris-versicolor
Iris-virginica

We can also use the .format() function on .unique() to print out our unique elements.

In [6]:
labels.iter().unique().format("\n")
Out[6]:
"Iris-setosa"
"Iris-versicolor"
"Iris-virginica"

However, we may want to store these in a vector to use later too. Let's store them in unique_elements.

In [21]:
let unique_elements = labels.iter().cloned().unique().collect_vec();

unique_elements
Out[21]:
["Iris-setosa", "Iris-versicolor", "Iris-virginica"]

Count of Unique Elements

In this case it's easy to see we have three unique elements, but if the list is too long or we wanted to store the value for later use, we can use the len() function on our vector containing the unique elements.

In [22]:
unique_elements.len()
Out[22]:
3

We could also get the count from our iterator.

In [23]:
labels.iter().unique().count()
Out[23]:
3

Frequency of Unique Elements

To find the frequency of a specific string we can use filter() and then count().

In [24]:
labels.iter().filter(|&elem| *elem == "Iris-virginica").count()
Out[24]:
50

We may also want to find the frequency of every unique element and store it in a vector. We'll call this unique_frequency.

In [25]:
let mut unique_frequency = Vec::<usize>::new();

We can populate this by iterating through the unique elements and use the strings to find the frequency using the filter() approach above.

In [26]:
for unique_elem in unique_elements.iter() {
    unique_frequency.push(labels.iter().filter(|&elem| elem == unique_elem).count());
};

unique_frequency
Out[26]:
[50, 50, 50]

Visualise the Frequency of Unique Elements

It's useful to visualise the frequency of elements in our array, especially if we're interested in looking at the balance in our dataset. We can do this with a Bar plot using Plotly.

In [27]:
let layout = Layout::new()
    .yaxis(Axis::new().title(Title::new("Frequency")));

let freq_bars = Bar::new(unique_elements, unique_frequency);

let mut plot = Plot::new();

plot.set_layout(layout);
plot.add_trace(freq_bars);

darn::show_plot(plot)
Out[27]:

Unique Elements and their Frequency with Hashmaps

We can do something similar with a HashMap. First we'll define a variable of type HashMap<String, i32>, where the String part will be the unique element, and the i32 will be the frequency.

In [14]:
let mut value_counts : HashMap<String, i32> = HashMap::new();

We can then populate this by iterating through our original labels array.

In [15]:
for item in labels.iter() {
    *value_counts.entry(String::from(item)).or_insert(0) += 1;
};

Printing out the results, we can see that value_counts now contains all the information that we're after.

In [16]:
println!("{:#?}", value_counts);
{
    "Iris-setosa": 50,
    "Iris-virginica": 50,

We can also get these values directly by key.

In [17]:
value_counts.get("Iris-versicolor").unwrap()
    "Iris-versicolor": 50,
}
Out[17]:
50

However, you will need to map this HashMap to separate vectors if you want to use it for plotting with Plotly.

Conclusion

In this section, we've demonstrated a few approaches to identifying the unique elements in an array, counting the number of unique elements, and the frequency of these unique elements. We also visualised the frequency of our unique elements which is useful during exploratory data analysis.

Coronavirus Time Series Map Animation

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.express as px

# Optional Customisations
import plotly.io as pio              # to set shahin plot layout
pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(
    legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),
    margin=dict(t=0,r=0,b=0,l=0))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected" # remove when running locally 

Introduction

In this section, we're going to visualise Novel Coronavirus 2019 time series data for confirmed cases, recovered cases and deaths. We'll be working on two visualisations:

  1. A static map visualising our features for the latest time group in the dataset.
  2. An interactive and animated map visualising our features over time.

We'll be using the Mapbox service so we'll need to set our access token.

In [2]:
access_token = 'pk.eyJ1Ijoic2hhaGlucm9zdGFtaSIsImEiOiJjazdudHRramQwMmM2M2xvZ2Q3Z3I4NW5wIn0.ZcEeYDKg4_JTvP5xhPeApw'
px.set_mapbox_access_token(access_token)

Note

To plot on Mapbox maps with Plotly you will need a Mapbox account and a public Mapbox Access Token. Copy yours over the string assigned to access_token in the cell above.

Bunny

The Dataset

We're going to be using the Novel Corona virus - COVID19 dataset with the following description:

The new strain of Coronavirus has had a worldwide effect. It has affected people from different countries. The dataset provides, a time series data tracking the number of people affected by the virus, how many deaths has the virus caused and the number of reported people who have recovered.

Let's download it from their repository and take a peak.

In [3]:
data_url = 'https://shahinrostami.com/datasets/time-series-19-covid-combined.csv'
data = pd.read_csv(data_url)
data.head()
Out[3]:
Date Country/Region Province/State Lat Long Confirmed Recovered Deaths
0 2020-01-22 Afghanistan NaN 33.0 65.0 0 0.0 0
1 2020-01-23 Afghanistan NaN 33.0 65.0 0 0.0 0
2 2020-01-24 Afghanistan NaN 33.0 65.0 0 0.0 0
3 2020-01-25 Afghanistan NaN 33.0 65.0 0 0.0 0
4 2020-01-26 Afghanistan NaN 33.0 65.0 0 0.0 0

We can see that we have the following features to work with.

In [4]:
print(data.columns.values)
['Date' 'Country/Region' 'Province/State' 'Lat' 'Long' 'Confirmed'
 'Recovered' 'Deaths']
  • Province/State and Country/Region. These features contain the named location information associated with the sample. We'll use the Province/State field in our on-hover tool-tip and to group our colour grouping. We can see from the first five samples that some of these may be empty, or NaN, so as a workaround we'll copy in the "Country/Region" feature where the data is missing.
In [5]:
missing_states = pd.isnull(data['Province/State'])
data.loc[missing_states,'Province/State'] = data.loc[missing_states,'Country/Region']
  • Lat and Long. These features contain the latitude and longitude geographic coordinates associated with the sample. We'll use both of these to determine where on the map we will draw our markers.
  • Date. This feature contains the date associated with the sample. We'll use this to build our animation over time.
  • Confirmed, Recovered, Deaths. These features contain numerical values for the number of confirmed cases, the number of recovered cases, and the number of deaths, respectively. We'll use the number of confirmed cases to change the size of our markers, and the number of deaths to change the colour.

We'll also add our own feature to estimate the number of active cases. We'll calculate it by subtracting the number of recovered cases and deaths from the confirmed cases. We can use this instead of the confirmed cases for our marker size in the animation.

In [6]:
data['Active'] = data['Confirmed'] - data['Recovered'] - data['Deaths']

There's a possibility we will have NaN values in our data. We're not interested in investigating this further or conducting any imputation in this section, we will simply remove any rows that have this issue.

In [7]:
data = data.dropna()

The Latest Information

Let's create the first of our two visualisations, this one will present our features for the most recent time point in our data. We need to create a Boolean mask so that we can select only the relevant samples.

In [8]:
date_mask = data['Date'] == data['Date'].max()
date_mask
Out[8]:
0        False
1        False
2        False
3        False
4        False
         ...  
15161    False
15162    False
15163    False
15164    False
15165     True
Name: Date, Length: 14074, dtype: bool

We can now use this mask to select a subset of our dataset to produce a Figure object with Plotly Express.

In [9]:
fig = px.scatter_mapbox(
    data[date_mask], lat="Lat", lon="Long",
    size="Confirmed", size_max=50,
    color="Deaths", color_continuous_scale=px.colors.sequential.Pinkyl,
    hover_name="Province/State",           
    mapbox_style='dark', zoom=1
)

Reading through the parameters, we can see that we've:

  • Passed in the masked subset of our dataset;
  • Specified the latitude and longitude columns of that DataFrame to position our markers
  • Set the size of our markers to be the number of confirmed cases, with a maximum size of 50.
  • Set the colour of our markers to the number of deaths, on a continuous scale with the colour palette "Pinkyl";
  • Set our on-hover tooltip to be the Province/State value;
  • and set our map style to dark with a zoom of 1.

One extra configuration change we'll make is to remove the axis scale that will appear to the right of the figure. This is just a case of preference, you can see what you think of it by removing the line below or setting the value to True instead.

In [10]:
fig.layout.coloraxis.showscale = False

All that's left is to display our visualisation.

In [11]:
fig.show()

You can interact with the visualisation by dragging, zooming, hovering, etc.

The Animated Time Series

Let's create the second of our two visualisations, this one will take us on a journey through time, giving us some idea of how the features have changed throughout the duration of the dataset. This time we will be passing in the entire DataFrame instead of the masked one, and we'll use our own active cases feature instead of the number of confirmed cases.

In [12]:
fig = px.scatter_mapbox(
    data, lat="Lat", lon="Long",
    size="Active", size_max=50,
    color="Deaths", color_continuous_scale=px.colors.sequential.Pinkyl,
    hover_name="Province/State",           
    mapbox_style='dark', zoom=1,
    animation_frame="Date", animation_group="Province/State"
)

We can see two additional parameters for this plot which are used to specify how the animation frames are generated, and how they are grouped from frame to frame. In addition to removing the axis scale, we'll also make some additional changes to customise our animation and the positioning of some of the control elements.

In [13]:
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 200
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 200
fig.layout.coloraxis.showscale = False
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10

Now we can display our visualisation.

In [14]:
fig.show()