Loading Datasets from CSV into NDArray

Preamble

In [2]:
:dep ndarray-csv = {version = "0.4.1"}
:dep ndarray = {version = "0.13.0"}
:dep darn = {version = "0.1.7"}
:dep ureq = {version = "0.11.4"}
extern crate csv;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use ndarray_csv::Array2Reader;

Introduction

In this section, we will shift our focus to dealing with real-world datasets and how to load them into an ndarray::Array. Once we have the dataset loaded, we will be interested in some of the high-level characteristics of the dataset, e.g. (but not limited to):

  • How many samples do we have?
  • How many features do we have?
  • What are the most suitable data types for each feature?
  • How many incomplete samples are there (missing values)?

To keep things manageable, we will rely on the famous tabular Iris Flower Dataset , created by Ronald Fischer. This is one of the most popular datasets in existence and has been used in many tutorials/examples found in the literature.

The Iris Flower.

Note

The Rust ecosystem isn't completely ready for data analysis, but there are many crates that offer data structures and analysis functionality. I considered a few of these for this book before settling on ndarray. Some of the crates that I considered include (but are not limited to): peroxide, brassfibre, utah, abomination, openml-rust, and rusty-machine. They all have their trade-offs and interesting features.

The Iris Flower Dataset

You can find the dataset within the UCI Machine Learning Repository, and it's also hosted by Kaggle. The multivariate dataset contains 150 samples of the following four real-valued attributes:

You can find the dataset within the UCI Machine Learning Repository, and it's also hosted by Kaggle. The multivariate dataset contains 150 samples of the following four real-valued attributes:

  • sepal length,
  • sepal width,
  • petal length,
  • and petal width.

All dimensions are suppled in centimetres. Associated with every sample is also the known classification of the flower:

  • Setosa,
  • Versicolour,
  • or Virginica.

Helpful diagram presenting the 4 attributes and 3 classifications in the Iris dataset.

Typically, this dataset is used to produce a classifier which can determine the classification of the flower when supplied with a sample of the four attributes.

Loading the Iris Flower Dataset into NDarray

In this section, we're going to first cover one of the challenges we may encounter when working with NDarray, and move onto one approach to loading our dataset into an ndarray::Array. We'll also have a look at some functions that make it convenient to display our dataset whilst we're working with it.

Homogeneous Multidimensional Arrays

NDarray provides a convenient container for our multidimensional data, and it even highlights its support for general elements and numerics, i.e. generic types. Through this approach an ndarray::Array supports the storage of homogeneous data, where every value in the container must be of the same type. These can be floats, integers, strings, and so on. For example, we could store a two-dimensional dataset consisting of only String elements in an ndarray::Array2<String>. However, we wouldn't then be able to have a feature, or column, consisting of unsigned 32-bit integers (u32). If you've worked with the numpy package for Python, you may remember the convenience of its support for homogeneous and heterogeneous data, where support exists for having features of different data types.

It's often the case that we aren't aware of the characteristics of a dataset we want to analyse. Learning about these characteristics will be part of our exploratory analyses, so it's helpful to be able to load all the data into our container without first knowing about the various data types. To workaround our limitation of a homogeneous container, we will load all of our data into an ndarray::Array of strings for the initial analyses. Based on our findings, we can formulate a strategy for how to load in the data for the more pertinent operations.

From URL to NDarray Array

Let's demonstrate one complete step-by-step process for downloading a CSV dataset from the web and loading it into an NDarray, all without leaving our Rust notebook.

If we're going to be downloading a file from the web we will likely be storing it to disk and using it from time-to-time. In these cases, it makes sense to store the file name or path in a variable.

In [3]:
let file_name = "Iris.csv";

Next, we're going to use the ureq crate to retrieve the contents of the CSV dataset into a String. The ureq crate is a minimal library that supports most of the standard request methods, e.g. GET, POST, PUT, etc. There are many alternative crates, such as the more popular reqwest crate, however, I made my selection based on which had the least in terms of dependency and overhead.

For convenience, I have mirrored the Iris Flower dataset at https://shahinrostami.com/datasets/Iris.csv.

In [4]:
let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

Now we have a string containing our entire CSV dataset. At this point, we may wish to dump the entire String into an output cell to see what it looks like. However, our dataset may be exceptionally large and dumping the entire string may be inconvenient and unhelpful. Of course, you have the option of opening this CSV in another application, but we may want to keep the entire process contained within this notebook so we have a reproducible and explained process.

So, let's gain an understanding of the current situation so that we can decide what to do next. First, we'll find out the length of our String.

In [5]:
res.len()
Out[5]:
5107

In this case, we have a String that consists of over $5000$ characters. Dumping all these characters to an output cell will negatively impact the presentation of our notebook without adding any real value.

Let's truncate the dataset so we can have a peak at the samples, we'll pick an arbitrary desired length of $500$. This will also tell us whether the CSV dataset has an initial row indicating the column names.

In [6]:
println!("{}", &res[..500]);
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
6,5.4,3.9,1.7,0.4,Iris-setosa
7,4.6,3.4,1.4,0.3,Iris-setosa
8,5.0,3.4,1.5,0.2,Iris-setosa
9,4.4,2.9,1.4,0.2,Iris-setosa
10,4.9,3.1,1.5,0.1,Iris-setosa
11,5.4,3.7,1.5,0.2,Iris-setosa
12,4.8,3.4,1.6,0.2,Iris-setosa
13,4.8,3.0,1.4,0.1,Iris-setosa
14,4.3,3.0,1.1,0.1,Iris-setosa
15,5.8,4.0

From the output we can see that what we've returned is indeed a CSV dataset. We can also see that the dataset has $6$ elements per row, and that it has indeed come with an initial row to indicate the column names.

Let's save this file locally before moving on in case we wish to load it multiple times or work offline.

In [7]:
let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());

Moving forward we'll work with this local file.

We'll use the csv crate, a fast and flexible CSV reader and writer for Rust, to load the CSV data into a Reader. The CSV reader automatically treats the first row as the header, but this is configurable with ReaderBuilder::has_headers. In our case the default value is fine.

In [8]:
let mut rdr = csv::Reader::from_path(file_name)?;

We mentioned that we may want to keep our local file around for future usage. However, if you do wish to delete it using the notebook you can use std::fs::remove_file().

In [9]:
remove_file(file_name)?;

Now we'll use the ndarray-csv crate to deserialize our CSV data into a homogeneous array (Array2<String) using ndarray-csv. We are using the deserialize_array2_dynamic() function which does not require us to specify the number of rows or columns.

In [10]:
let data: Array2<String> = rdr.deserialize_array2_dynamic()?;

Let's output our dataset stored in data to see if it looks as expected. We will replicate the convenient behaviour of the DataFrame from pandas which presents an HTML formatted table. Previously we used our own function from the darn crate, darn::show_array(), but I have extended this into a new function, darn::show_frame(), to support headers too. In this case, we don't have easy access to our headers just yet, so we will pass in None as our second parameter.

In [11]:
darn::show_frame(&data, None);
Out[11]:
"1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa"
"2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa"
"3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa"
"4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa"
"5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa"
... ... ... ... ... ...
"146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica"
"147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica"
"148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica"
"149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica"
"150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

You can see that, unlike darn::show_array(), our new darn::show_frame() function does not dump every sample to the output. Instead, it mimics the pandas approach where only the first and last five samples are presented, with a row of ellipses in-between.

Now let's extract our headers so we access them easily and use them in our darn::show_frame() function too. First, we'll create a new Vector of String elements to store them.

In [12]:
let mut headers : Vec<String> = Vec::new();

Next, we'll iterate through every element in the csv::StringRecord that stores our header row and push each one to our new vector.

In [13]:
for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

Another approach for getting our csv::StringRecord header elements into our vector is to use indices paired with the get() method:

for i in 0..rdr.headers()?.len(){
    headers.push(String::from(rdr.headers()?.get(i).unwrap()));
};

Let's dump our headers vector to an output cell to make sure it looks as it should.

In [14]:
headers
Out[14]:
["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species"]

This should look as we expected, and we can confirm by comparing it to the raw CSV String output from earlier. Let's use the darn::show_frame() function again, but this time we can also provide the headers.

In [16]:
darn::show_frame(&data, Some(&headers));
Out[16]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
"1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa"
"2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa"
"3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa"
"4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa"
"5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa"
... ... ... ... ... ...
"146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica"
"147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica"
"148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica"
"149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica"
"150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

Looking good! Moving forward, we want to start interrogating our dataset to learn its characteristics. We may use what we learn to change our container approach so that it's more suitable for the data types and our desired operations.

All Together

Before concluding, let's put everything together into a compact cell.

In [19]:
let file_name = "Iris.csv";

let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String>= rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

We'll also use darn::show_frame() again to display some of the samples and the headers.

In [20]:
darn::show_frame(&data, Some(&headers));

Conclusion

In this section, we've demonstrated how to get a dataset from into an ndarray::Array. We started by downloading our CSV file from the web, loading it using a CSV reader, deserialising it into a homogeneous array, extracting the headers into a vector, and then presenting some of the samples using a HTML table. In the next section, we'll start interrogating the dataset to learn more about the samples and features.

Support this work

You can access this notebook and more by getting the e-book on Data Analysis with Rust Notebooks.