Typed Arrays from String Arrays for Dataset Operation

Preamble

In [2]:
:dep ndarray-csv = {version = "0.4.1"}
:dep ndarray = {version = "0.13.0"}
:dep darn = {version = "0.1.15"}
:dep ureq = {version = "0.11.4"}
:dep plotly = {version = "0.4.0"}
extern crate csv;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use ndarray_csv::Array2Reader;
use std::str::FromStr;
use plotly::{Plot, Scatter, Layout};
use plotly::common::{Mode, Title};
use plotly::layout::{Axis};

Introduction

In this section, we're going to move from a raw dataset stored in a single string array (ndarray::Array2<String>) to multiple arrays of the desired type. This will enable us to use the appropriate operations for our different types of data. We will demonstrate our approach using the Iris Flower dataset.

Loading our Dataset

Before we move onto moving parts of our Iris Flower dataset into different typed arrays, we need to load it into our raw string array.

In [3]:
let file_name = "Iris.csv";

let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String> = rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

Let's display some rows from the string array to see if it;s loaded as expected.

In [4]:
darn::show_frame(&data, Some(&headers));
Out[4]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
"1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa"
"2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa"
"3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa"
"4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa"
"5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa"
... ... ... ... ... ...
"146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica"
"147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica"
"148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica"
"149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica"
"150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

Without digging deeper, it looks like we have the correct number of columns and rows.

Raw Dataset Dimensions

Once the data is loaded we may want to determine the number of samples (rows) and features (columns) in our dataset. We can get this information using the shape() function.

In [5]:
&data.shape()
Out[5]:
[150, 6]

We can see that it's returned an array, where the first element indicates the number of rows and the second element indicates the number of columns. If you have prior knowledge of the dataset then this may be a good indicator as to whether your dataset has loaded correctly. This information will be useful later when we're initialising a new array.

Deciding on Data Types

Our dataset is currently loaded into a homogeneous array of strings, ndarray::Array2<String>. This data type has allowed us to load all our data in from a CSV file without prior knowledge of the suitable data types. However, it now means that we cannot apply pertinent operations depending on the feature. For example, we aren't able to easily determine any central tendencies for the SepealLenghCm column, or convert our units from centimetres to something else. All we can do right now is operate on these values as strings.

Let's see what happens if we try to operate on the data in its current form, e.g. if we want to find the mean average value of each column.

In [21]:
data.mean_axis(Axis(0)).unwrap()
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ the trait `num_traits::identities::Zero` is not implemented for `std::string::String`
the trait bound `std::string::String: num_traits::identities::Zero` is not satisfied
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ the trait `num_traits::cast::FromPrimitive` is not implemented for `std::string::String`
the trait bound `std::string::String: num_traits::cast::FromPrimitive` is not satisfied
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ no implementation for `std::string::String / std::string::String`
cannot divide `std::string::String` by `std::string::String`

As expected, Rust is complaining about our data types. Let's look again at a summary of the dataset and make some decisions about our desired data types.

In [7]:
darn::show_frame(&data, Some(&headers));
Out[7]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
"1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa"
"2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa"
"3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa"
"4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa"
"5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa"
... ... ... ... ... ...
"146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica"
"147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica"
"148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica"
"149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica"
"150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

We can see that we have six columns, the names of which we have stored in our headers vector.

In [8]:
&headers
Out[8]:
["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species"]

Let's go through them one-by-one and decide which data type will support our desired operations.

  • Id. This is an identifier that came from the original CSV file. We don't have much use for this column in our upcoming analyses, so in this case, we're going to drop this column.
  • SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. This is the multivariate data that describes each flower sample with regards to the length and width of the sepals and petals. These are numerical values with fractional parts, so we may want to store them in a floating-point data type, e.g. f32.
  • Species. This column contains the true species of the flower samples. These are categorical values, so we may wish to convert them to numerical (integer) values, e.g. u32, or keep them as strings. We'll continue with the String type for now to keep things simple.

Moving Data to Typed Arrays

Once we've decided on what data types we want to employ we can move onto creating our typed arrays. This involves converting values from String to the desired type, and moving our data over to the new and typed arrays.

The Id Column

We've decided that we don't need this column, so it requires no action.

The SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm Columns

We've decided that we want these columns to have a data type of f32, so we need to convert and move them into a new homogeneous array, this time of ndarray::Array2<f32>. We're going to achieve this by using std::str::FromStr which gives us access to the from_str() function that allows us to parse values from strings.

Let's demonstrate this approach on the first of the four columns, SepalLengthCm. We'll dump the column to an output cell to see the before and after.

In [9]:
data.column(1)
Out[9]:
["5.1", "4.9", "4.7", "4.6", "5.0", "5.4", "4.6", "5.0", "4.4", "4.9", "5.4", "4.8", "4.8", "4.3", "5.8", "5.7", "5.4", "5.1", "5.7", "5.1", "5.4", "5.1", "4.6", "5.1", "4.8", "5.0", "5.0", "5.2", "5.2", "4.7", "4.8", "5.4", "5.2", "5.5", "4.9", "5.0", "5.5", "4.9", "4.4", "5.1", "5.0", "4.5", "4.4", "5.0", "5.1", "4.8", "5.1", "4.6", "5.3", "5.0", "7.0", "6.4", "6.9", "5.5", "6.5", "5.7", "6.3", "4.9", "6.6", "5.2", "5.0", "5.9", "6.0", "6.1", "5.6", "6.7", "5.6", "5.8", "6.2", "5.6", "5.9", "6.1", "6.3", "6.1", "6.4", "6.6", "6.8", "6.7", "6.0", "5.7", "5.5", "5.5", "5.8", "6.0", "5.4", "6.0", "6.7", "6.3", "5.6", "5.5", "5.5", "6.1", "5.8", "5.0", "5.6", "5.7", "5.7", "6.2", "5.1", "5.7", "6.3", "5.8", "7.1", "6.3", "6.5", "7.6", "4.9", "7.3", "6.7", "7.2", "6.5", "6.4", "6.8", "5.7", "5.8", "6.4", "6.5", "7.7", "7.7", "6.0", "6.9", "5.6", "7.7", "6.3", "6.7", "7.2", "6.2", "6.1", "6.4", "7.2", "7.4", "7.9", "6.4", "6.3", "6.1", "7.7", "6.3", "6.4", "6.0", "6.9", "6.7", "6.9", "5.8", "6.8", "6.7", "6.7", "6.3", "6.5", "6.2", "5.9"], shape=[150], strides=[6], layout=Custom (0x0), const ndim=1

It's clear from the output of data.column(1) that every element is a string. Now let's use mapv() to go through every element of data.column(1), parse each value to f32, and return the new array.

In [10]:
data.column(1).mapv(|elem| f32::from_str(&elem).unwrap())
Out[10]:
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, 7.0, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5.0, 5.9, 6.0, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6.0, 5.7, 5.5, 5.5, 5.8, 6.0, 5.4, 6.0, 6.7, 6.3, 5.6, 5.5, 5.5, 6.1, 5.8, 5.0, 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6.0, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], shape=[150], strides=[1], layout=CF (0x3), const ndim=1

Looking at the output of our operations we can see that we were successful in parsing our string values into numerical ones. Let's now use this approach to create a new array named data_features of type ndarray::Array2<f32>. We'll need to convert our one-dimensional arrays into two-dimensional column arrays using insert_axis(Axis(1)) as we stack them.

In [11]:
let features: Array2::<f32> = 
    ndarray::stack![Axis(1),
        data.column(1)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(2)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(3)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(4)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];

If we don't want to copy and paste the same line of code multiple times (which we don't) we can use a loop instead. First we need to create an array of integers that identify the column indices of the features that we want to convert.

In [12]:
let selected_features = [1, 2, 3, 4];

We can now iterate through the array of column indices, named selected_features, and stack each converted column into a new array of type Array2::<f32>.

In [13]:
let mut features: Array2::<f32> =  Array2::<f32>::zeros((data.shape()[0],0));

for &f in selected_features.iter() {
    features = ndarray::stack![Axis(1), features,
        data.column(f as usize)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];
};

Our headers vector (which describes 6 columns with 6 elements) doesn't broadcast onto our new features array (4 columns), so we'll create a new headers vector, feature_headers.

In [14]:
let feature_headers = headers[1..5].to_vec();

We now have our floating-point typed features (SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm) in a single array. Let's see how the data looks in its new form.

In [15]:
darn::show_frame(&features, Some(&feature_headers));
Out[15]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
... ... ... ...
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8

We're only seeing 10 samples array summary, so let's plot our features to get a better idea.

In [23]:
let layout = Layout::new()
    .xaxis(Axis::new().title(Title::new("Length (cm)")))
    .yaxis(Axis::new().title(Title::new("Width (cm)")));

let sepal = Scatter::new(features.column(0).to_vec(), features.column(1).to_vec())
    .name("Sepal")
    .mode(Mode::Markers);
let petal = Scatter::new(features.column(2).to_vec(), features.column(3).to_vec())
    .name("Petal")
    .mode(Mode::Markers);

let mut plot = Plot::new();

plot.set_layout(layout);
plot.add_trace(sepal);
plot.add_trace(petal);

darn::show_plot(plot);
Out[23]:

Let's also check to see that we can now calculate the mean average per column.

In [17]:
features.mean_axis(Axis(0)).unwrap()
Out[17]:
[5.8433347, 3.054, 3.7586665, 1.1986669], shape=[4], strides=[1], layout=CF (0x3), const ndim=1

It appears to be operating as expected. You could validate the results by running the same operation in different software.

The Species Column

We've decided to keep these as strings. Let's index the Species column and store it in its own array named labels.

In [18]:
let labels: Array1::<String> = data.column(5).to_owned();

Finally, we'll check the first 10 elements of our new labels array.

In [19]:
labels.slice(s![0..10])
Out[19]:
["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa"], shape=[10], strides=[1], layout=CF (0x3), const ndim=1

We could also use the itertools crate to apply some better presentation to this output.

In [20]:
:dep itertools = {version = "0.9.0"}
extern crate itertools;
use itertools::Itertools;

labels.slice(s![0..10]).iter().format("\n")
Out[20]:
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"

Conclusion

In this section, we've demonstrated how to get parts of our raw string array into multiple arrays of various type. With this approach, we can now start operating on our data using appropriate operators for our analyses.

Visualising the confirmed cases of COVID-19 in England with Mapbox

Preamble

In [ ]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from IPython.display import display, clear_output

pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=0,l=0))).layout.template
pio.templates.default = 'shahin'

Introduction

This section is similar to the previous ones on Visualising the confirmed cases of COVID-19 in England with Scattergeo. However, in this section we will be using more recent data and plotting with Mapbox instead of Scattergeo.

I came across the Table of confirmed cases of COVID-19 in England provided by Public Health England and thought it would be useful to visualise it. I have no doubt a similar visualisation already exists, but I thought it would be an interesting exercise. The data used throughout this notebook was last updated at 9:00 am on 11 March 2020.

Note taken from the data source

Data may be subject to delays in case confirmation and reporting, as well as ongoing data cleaning.

Location is based on case residential postcode. When this is not available, NHS trust or reporting laboratory postcode is used. The data is therefore subject to change.

Counts for Isles of Scilly and City of London are combined with Cornwall and Hackney respectively for disclosure control.

Visualising the Table

The first step was to copy and paste the data from the table into a CSV, followed by adding two column headings for lat and lon respectively. I have hosted the CSV for convenient and easy reproducibility.

Let's load the data into a pandas.DataFrame and look at the first few records.

In [2]:
data = pd.read_csv('https://shahinrostami.com/datasets/phe_covid_uk_11032020.csv')
data.head()
Out[2]:
local_authority confirmed_cases lat lon
0 Barking and Dagenham 1 NaN NaN
1 Barnet 8 NaN NaN
2 Barnsley 2 NaN NaN
3 Bath and North East Somerset 0 NaN NaN
4 Bedford 0 NaN NaN

We have the local authority (we can consider these to be locations) and the number of confirmed cases. You can also see our lat and lot columns are empty. Let's populate these by making requests through GeoPy.

First, we need an instance of the Nominatim (geocoder for OpenStreetMap data) object. We don't want to violate the usage policy, so we'll also pass in a user_agent.

In [3]:
geolocator = Nominatim(user_agent="covid_shahinrostami.com")

Let's see if we can get some location data using one of our local_authority items. To demonstrate, we'll use the one for my hometown, Bournemouth.

In [4]:
data.local_authority[10]
Out[4]:
'Bournemouth, Christchurch and Poole'

This will now be passed into the geocode() method. We'll also append "UK" to the string for disambiguation, e.g. France has a "Bury" too.

In [5]:
location = geolocator.geocode(f"{data.local_authority[10]}, UK")
location
Out[5]:
Location(Bournemouth, Bournemouth, Christchurch and Poole, South West England, England, United Kingdom, (50.744673199999994, -1.8579577396350433, 0.0))

It looks like it's returned all the information we need. We will need to access this directly too.

In [6]:
print(location.latitude, location.longitude)
50.744673199999994 -1.8579577396350433

Now we need to do this for every local_authority in our dataset and fill in the missing lat and lon values.

In [7]:
for index, row in data.iterrows():
    location = geolocator.geocode(f"{row.local_authority}, UK",timeout=100)

    data.loc[index,'lat'] = location.latitude 
    data.loc[index,'lon'] = location.longitude

    # None of the following code is required
    # I just wanted a progress bar!
    clear_output(wait = True)
    amount_unloaded = np.floor(((data.shape[0]-index)/data.shape[0])*25).astype(int)
    amount_loaded = np.ceil((index/data.shape[0])*25).astype(int)
    loading = f"Retrieving locations >{'|'*amount_loaded}{'.'*amount_unloaded}<"
    display(loading)

print("Done!")
'Retrieving locations >|||||||||||||||||||||||||<'
Done!

Now let's put this on the map! We'll go for a bubble plot on a map of the UK, where larger bubbles indicate more confirmed cases.

Note

To plot on Mapbox maps with Plotly you will need a Mapbox account and a public Mapbox Access Token. I've removed mine from mapbox_access_token in the cell below.

In [8]:
data['text'] = data['local_authority'] + '<br>Confirmed Cases ' + (data['confirmed_cases']).astype(str)

import plotly.graph_objects as go

mapbox_access_token = "your_mapbox_access_token"

fig = go.Figure(go.Scattermapbox(
    lon = data['lon'],
    lat = data['lat'],
        mode = 'markers',
        marker = go.scattermapbox.Marker(
            size = data['confirmed_cases']/.5,
            color = 'rgb(180,0,0)',
        ),
        text = data['text'],
    ))

fig.update_layout(
    autosize = True,
    hovermode = 'closest',
    mapbox = dict(
        accesstoken = mapbox_access_token,
        bearing = 0,
        center = {'lat': (data.lat.min() + data.lat.max())/2,
                'lon': (data.lon.min() + data.lon.max())/2},
        pitch = 0,
        zoom = 5,
        style = "basic", # try basic, dark, light, outdoors, or satellite.
    ),
)

fig.show()