Visualising the confirmed cases of COVID-19 in England (15 March 2020)

Preamble

In [23]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from IPython.display import display, clear_output

pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=0,l=0))).layout.template
pio.templates.default = 'shahin'

Introduction

This section is similar to the previous ones on Visualising the confirmed cases of COVID-19 in England with Scattergeo and Mapbox.

Since publishing those visualisations, it was announced that the data source is no longer being updated. Instead, the data has now been made available at arcgis.com

Terms of use taken from the data source

No special restrictions or limitations on using the item’s content have been provided.

Visualising the Table

The first step is to read the CSV data into a pandas.DataFrame and display the first five samples.

In [8]:
data = pd.read_csv('https://www.arcgis.com/sharing/rest/content/items/b684319181f94875a6879bbc833ca3a6/data')
data.head()
Out[8]:
GSS_CD GSS_NM TotalCases
0 E09000002 Barking and Dagenham 7
1 E09000003 Barnet 25
2 E08000016 Barnsley 5
3 E06000022 Bath and North East Somerset 2
4 E06000055 Bedford 0

We have the local authority (GSS_NM) and the number of confirmed cases (TotalCases). Let's add two new columns, lat and lon, so that we can store the location information that we'll use to visualise the data.

In [9]:
data["lat"] = np.nan
data["lon"] = np.nan
data.head()
Out[9]:
GSS_CD GSS_NM TotalCases lat lon
0 E09000002 Barking and Dagenham 7 NaN NaN
1 E09000003 Barnet 25 NaN NaN
2 E08000016 Barnsley 5 NaN NaN
3 E06000022 Bath and North East Somerset 2 NaN NaN
4 E06000055 Bedford 0 NaN NaN

We can see the columns are empty. Let's populate these by making requests through GeoPy.

First, we need an instance of the Nominatim (geocoder for OpenStreetMap data) object. We don't want to violate the usage policy, so we'll also pass in a user_agent.

In [10]:
geolocator = Nominatim(user_agent="covid_shahinrostami.com")

Let's see if we can get some location data using one of our GSS_NM items. To demonstrate, we'll use the one for my hometown, Bournemouth.

In [11]:
data.GSS_NM[10]
Out[11]:
'Bournemouth, Christchurch and Poole'

This will now be passed into the geocode() method. We'll also append "UK" to the string for disambiguation, e.g. France has a "Bury" too.

In [12]:
location = geolocator.geocode(f"{data.GSS_NM[10]}, UK")
location
Out[12]:
Location(Bournemouth, Bournemouth, Christchurch and Poole, South West England, England, United Kingdom, (50.744673199999994, -1.8579577396350433, 0.0))

It looks like it's returned all the information we need. We will need to access this directly too.

In [13]:
print(location.latitude, location.longitude)
50.744673199999994 -1.8579577396350433

Now we need to do this for every local_authority in our dataset and fill in the missing lat and lon values.

In [14]:
for index, row in data.iterrows():
    location = geolocator.geocode(f"{row.GSS_NM}, UK",timeout=100)

    data.loc[index,'lat'] = location.latitude 
    data.loc[index,'lon'] = location.longitude

    # None of the following code is required
    # I just wanted a progress bar!
    clear_output(wait = True)
    amount_unloaded = np.floor(((data.shape[0]-index)/data.shape[0])*25).astype(int)
    amount_loaded = np.ceil((index/data.shape[0])*25).astype(int)
    loading = f"Retrieving locations >{'|'*amount_loaded}{'.'*amount_unloaded}<"
    display(loading)

print("Done!")
'Retrieving locations >|||||||||||||||||||||||||<'
Done!

Now let's put this on the map! We'll go for a bubble plot on a map of the UK, where larger bubbles indicate more confirmed cases.

Note

To plot on Mapbox maps with Plotly you will need a Mapbox account and a public Mapbox Access Token. I've removed mine from mapbox_access_token in the cell below.

In [21]:
data['text'] = data['GSS_NM'] + '<br>Confirmed Cases ' + (data['TotalCases']).astype(str)

import plotly.graph_objects as go

mapbox_access_token = "your_mapbox_access_token"

fig = go.Figure(go.Scattermapbox(
    lon = data['lon'],
    lat = data['lat'],
        mode = 'markers',
        marker = go.scattermapbox.Marker(
            size = data['TotalCases']/1,
            color = 'rgb(180,0,0)',
        ),
        text = data['text'],
    ))

fig.update_layout(
    autosize = True,
    hovermode = 'closest',
    mapbox = dict(
        accesstoken = mapbox_access_token,
        bearing = 0,
        center = {'lat': (data.lat.min() + data.lat.max())/2,
                'lon': (data.lon.min() + data.lon.max())/2},
        pitch = 0,
        zoom = 5,
        style = "basic", # try basic, dark, light, outdoors, or satellite.
    ),
)

fig.show()

It's an interactive plot, so you can hover over it to get more information.

Conclusion

In this notebook, we went on a rather quick journey. We loaded in some data from a hosted CSV file, used a helpful service to populate some location data, and plotted it all on a map using Plotly.

Typed Arrays from String Arrays for Dataset Operation

Preamble

In [2]:
:dep ndarray-csv = {version = "0.4.1"}
:dep ndarray = {version = "0.13.0"}
:dep darn = {version = "0.1.7"}
:dep ureq = {version = "0.11.4"}
:dep plotly = {version = "0.4.0"}
extern crate csv;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use ndarray_csv::Array2Reader;
use std::str::FromStr;
use plotly::{Plot, Scatter, Layout};
use plotly::common::{Mode, Title};
use plotly::layout::{Axis};

Introduction

In this section, we're going to move from a raw dataset stored in a single string array (ndarray::Array2<String>) to multiple arrays of the desired type. This will enable us to use the appropriate operations for our different types of data. We will demonstrate our approach using the Iris Flower dataset.

Loading our Dataset

Before we move onto moving parts of our Iris Flower dataset into different typed arrays, we need to load it into our raw string array.

In [3]:
let file_name = "Iris.csv";

let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String> = rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

Let's display some rows from the string array to see if it;s loaded as expected.

In [4]:
darn::show_frame(&data, Some(&headers));
Out[4]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
"1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa"
"2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa"
"3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa"
"4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa"
"5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa"
... ... ... ... ... ...
"146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica"
"147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica"
"148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica"
"149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica"
"150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

Without digging deeper, it looks like we have the correct number of columns and rows.

Raw Dataset Dimensions

Once the data is loaded we may want to determine the number of samples (rows) and features (columns) in our dataset. We can get this information using the shape() function.

In [5]:
&data.shape()
Out[5]:
[150, 6]

We can see that it's returned an array, where the first element indicates the number of rows and the second element indicates the number of columns. If you have prior knowledge of the dataset then this may be a good indicator as to whether your dataset has loaded correctly. This information will be useful later when we're initialising a new array.

Deciding on Data Types

Our dataset is currently loaded into a homogeneous array of strings, ndarray::Array2<String>. This data type has allowed us to load all our data in from a CSV file without prior knowledge of the suitable data types. However, it now means that we cannot apply pertinent operations depending on the feature. For example, we aren't able to easily determine any central tendencies for the SepealLenghCm column, or convert our units from centimetres to something else. All we can do right now is operate on these values as strings.

Let's see what happens if we try to operate on the data in its current form, e.g. if we want to find the mean average value of each column.

In [6]:
data.mean_axis(Axis(0)).unwrap()
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ the trait `num_traits::identities::Zero` is not implemented for `std::string::String`
the trait bound `std::string::String: num_traits::identities::Zero` is not satisfied
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ the trait `num_traits::cast::FromPrimitive` is not implemented for `std::string::String`
the trait bound `std::string::String: num_traits::cast::FromPrimitive` is not satisfied
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ no implementation for `std::string::String / std::string::String`
cannot divide `std::string::String` by `std::string::String`

As expected, Rust is complaining about our data types. Let's look again at a summary of the dataset and make some decisions about our desired data types.

In [7]:
darn::show_frame(&data, Some(&headers));
Out[7]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
"1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa"
"2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa"
"3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa"
"4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa"
"5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa"
... ... ... ... ... ...
"146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica"
"147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica"
"148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica"
"149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica"
"150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

We can see that we have six columns, the names of which we have stored in our headers vector.

In [8]:
&headers
Out[8]:
["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species"]

Let's go through them one-by-one and decide which data type will support our desired operations.

  • Id. This is an identifier that came from the original CSV file. We don't have much use for this column in our upcoming analyses, so in this case, we're going to drop this column.
  • SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. This is the multivariate data that describes each flower sample with regards to the length and width of the sepals and petals. These are numerical values with fractional parts, so we may want to store them in a floating-point data type, e.g. f32.
  • Species. This column contains the true species of the flower samples. These are categorical values, so we may wish to convert them to numerical (integer) values, e.g. u32, or keep them as strings. We'll continue with the String type for now to keep things simple.

Moving Data to Typed Arrays

Once we've decided on what data types we want to employ we can move onto creating our typed arrays. This involves converting values from String to the desired type, and moving our data over to the new and typed arrays.

The Id Column

We've decided that we don't need this column, so it requires no action.

The SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm Columns

We've decided that we want these columns to have a data type of f32, so we need to convert and move them into a new homogeneous array, this time of ndarray::Array2<f32>. We're going to achieve this by using std::str::FromStr which gives us access to the from_str() function that allows us to parse values from strings.

Let's demonstrate this approach on the first of the four columns, SepalLengthCm. We'll dump the column to an output cell to see the before and after.

In [9]:
data.column(1)
Out[9]:
["5.1", "4.9", "4.7", "4.6", "5.0", "5.4", "4.6", "5.0", "4.4", "4.9", "5.4", "4.8", "4.8", "4.3", "5.8", "5.7", "5.4", "5.1", "5.7", "5.1", "5.4", "5.1", "4.6", "5.1", "4.8", "5.0", "5.0", "5.2", "5.2", "4.7", "4.8", "5.4", "5.2", "5.5", "4.9", "5.0", "5.5", "4.9", "4.4", "5.1", "5.0", "4.5", "4.4", "5.0", "5.1", "4.8", "5.1", "4.6", "5.3", "5.0", "7.0", "6.4", "6.9", "5.5", "6.5", "5.7", "6.3", "4.9", "6.6", "5.2", "5.0", "5.9", "6.0", "6.1", "5.6", "6.7", "5.6", "5.8", "6.2", "5.6", "5.9", "6.1", "6.3", "6.1", "6.4", "6.6", "6.8", "6.7", "6.0", "5.7", "5.5", "5.5", "5.8", "6.0", "5.4", "6.0", "6.7", "6.3", "5.6", "5.5", "5.5", "6.1", "5.8", "5.0", "5.6", "5.7", "5.7", "6.2", "5.1", "5.7", "6.3", "5.8", "7.1", "6.3", "6.5", "7.6", "4.9", "7.3", "6.7", "7.2", "6.5", "6.4", "6.8", "5.7", "5.8", "6.4", "6.5", "7.7", "7.7", "6.0", "6.9", "5.6", "7.7", "6.3", "6.7", "7.2", "6.2", "6.1", "6.4", "7.2", "7.4", "7.9", "6.4", "6.3", "6.1", "7.7", "6.3", "6.4", "6.0", "6.9", "6.7", "6.9", "5.8", "6.8", "6.7", "6.7", "6.3", "6.5", "6.2", "5.9"], shape=[150], strides=[6], layout=Custom (0x0), const ndim=1

It's clear from the output of data.column(1) that every element is a string. Now let's use mapv() to go through every element of data.column(1), parse each value to f32, and return the new array.

In [10]:
data.column(1).mapv(|elem| f32::from_str(&elem).unwrap())
Out[10]:
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, 7.0, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5.0, 5.9, 6.0, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6.0, 5.7, 5.5, 5.5, 5.8, 6.0, 5.4, 6.0, 6.7, 6.3, 5.6, 5.5, 5.5, 6.1, 5.8, 5.0, 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6.0, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], shape=[150], strides=[1], layout=C | F (0x3), const ndim=1

Looking at the output of our operations we can see that we were successful in parsing our string values into numerical ones. Let's now use this approach to create a new array named data_features of type ndarray::Array2<f32>. We'll need to convert our one-dimensional arrays into two-dimensional column arrays using insert_axis(Axis(1)) as we stack them.

In [11]:
let features: Array2::<f32> = 
    ndarray::stack![Axis(1),
        data.column(1)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(2)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(3)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(4)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];

If we don't want to copy and paste the same line of code multiple times (which we don't) we can use a loop instead. First we need to create an array of integers that identify the column indices of the features that we want to convert.

In [12]:
let selected_features = [1, 2, 3, 4];

We can now iterate through the array of column indices, named selected_features, and stack each converted column into a new array of type Array2::<f32>.

In [13]:
let mut features: Array2::<f32> =  Array2::<f32>::zeros((data.shape()[0],0));

for &f in selected_features.iter() {
    features = ndarray::stack![Axis(1), features,
        data.column(f as usize)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];
};

Our headers vector (which describes 6 columns with 6 elements) doesn't broadcast onto our new features array (4 columns), so we'll create a new headers vector, feature_headers.

In [14]:
let feature_headers = headers[1..5].to_vec();

We now have our floating-point typed features (SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm) in a single array. Let's see how the data looks in its new form.

In [15]:
darn::show_frame(&features, Some(&feature_headers));
Out[15]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
... ... ... ...
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8

We're only seeing 10 samples array summary, so let's plot our features to get a better idea.

In [16]:
let layout = Layout::new()
    .xaxis(Axis::new().title(Title::new("Length (cm)")))
    .yaxis(Axis::new().title(Title::new("Width (cm)")));

let sepal = Scatter::new(features.column(0).to_vec(), features.column(1).to_vec())
    .name("Sepal")
    .mode(Mode::Markers);
let petal = Scatter::new(features.column(2).to_vec(), features.column(3).to_vec())
    .name("Petal")
    .mode(Mode::Markers);

let mut plot = Plot::new();

plot.set_layout(layout);
plot.add_trace(sepal);
plot.add_trace(petal);

darn::show_plot(plot);
Out[16]: