# Unique Array Elements and their Frequency

## Preamble¶

In [2]:
:dep ndarray-csv = {version = "0.4.1"}
:dep ndarray = {version = "0.13.0"}
:dep darn = {version = "0.1.7"}
:dep ureq = {version = "0.11.4"}
:dep itertools = {version = "0.9.0"}
:dep plotly = {version = "0.4.0"}
extern crate csv;
extern crate itertools;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use std::str::FromStr;
use itertools::Itertools;
use plotly::{Plot, Bar, Layout};
use plotly::common::{Mode, Title};
use plotly::layout::{Axis};
use std::collections::HashMap;


## Introduction¶

In this section, we're going to take a look at some approaches to determining the unique elements in a column of data and their frequency. This is a common task which we may often apply to categorical data, and it can easily give us a good idea of the balance of categorical values across our dataset.

If you're familiar with Python and NumPy, you may have encountered the numpy.unique()function that returns a list of sorted unique elements of an array. In more recent versions of NumPy, the numpy.unique() function also takes a parameter named return_counts, which when set to True will also return the number of times each unique element appears in the array.

Let's see how we can do the same for our ndarray::Array2.

We will continue using the Iris Flower dataset, so we need to load it into our raw string array first.

In [3]:
let file_name = "Iris.csv";

let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String>= rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
};


### Moving Data to Typed Arrays¶

We need to convert from String to the desired type, and move our data over to the typed arrays.

In [4]:
let mut features: Array2::<f32> =  Array2::<f32>::zeros((data.shape()[0],0));

for &f in [1, 2, 3, 4].iter() {
features = ndarray::stack![Axis(1), features,
data.column(f as usize)
.mapv(|elem| f32::from_str(&elem).unwrap())
.insert_axis(Axis(1))];
};

let labels: Array1::<String> = data.column(5).to_owned();


We will only be using our species labels, stored in labels, throughout the rest of this section.

## Unique Elements¶

We can get the unique elements in our array using the unique() function provided by the itertools crate.

Return an iterator adaptor that filters out elements that have already been produced once during the iteration. Duplicates are detected using hash and equality.

We can then iterate through this and output the elements to the output cell.

In [5]:
for i in labels.iter().unique() {
println!("{}",i);
};

Iris-setosa
Iris-versicolor
Iris-virginica


We can also use the .format() function on .unique() to print out our unique elements.

In [6]:
labels.iter().unique().format("\n")

Out[6]:
"Iris-setosa"
"Iris-versicolor"
"Iris-virginica"

However, we may want to store these in a vector to use later too. Let's store them in unique_elements.

In [7]:
let unique_elements = labels.iter().cloned().unique().collect_vec();

unique_elements

Out[7]:
["Iris-setosa", "Iris-versicolor", "Iris-virginica"]

## Count of Unique Elements¶

In this case it's easy to see we have three unique elements, but if the list is too long or we wanted to store the value for later use, we can use the len() function on our vector containing the unique elements.

In [8]:
unique_elements.len()

Out[8]:
3

We could also get the count from our iterator.

In [9]:
labels.iter().unique().count()

Out[9]:
3

## Frequency of Unique Elements¶

To find the frequency of a specific string we can use filter() and then count().

In [10]:
labels.iter().filter(|&elem| *elem == "Iris-virginica").count()

Out[10]:
50

We may also want to find the frequency of every unique element and store it in a vector. We'll call this unique_frequency.

In [11]:
let mut unique_frequency = Vec::<usize>::new();


We can populate this by iterating through the unique elements and use the strings to find the frequency using the filter() approach above.

In [12]:
for unique_elem in unique_elements.iter() {
unique_frequency.push(labels.iter().filter(|&elem| elem == unique_elem).count());
};

unique_frequency

Out[12]:
[50, 50, 50]

## Visualise the Frequency of Unique Elements¶

It's useful to visualise the frequency of elements in our array, especially if we're interested in looking at the balance in our dataset. We can do this with a Bar plot using Plotly.

In [13]:
let layout = Layout::new()
.yaxis(Axis::new().title(Title::new("Frequency")));

let freq_bars = Bar::new(unique_elements, unique_frequency);

let mut plot = Plot::new();

plot.set_layout(layout);