Class Imbalance and Oversampling

In this article we're going to introduce the problem of dataset class imbalance which often occurs in real-world classification problems. We'll then look at oversampling as a possible solution and provide a coded example as a demonstration on an imbalanced dataset.

Class imbalance

Let's assume we have a dataset where the data points are classified into two categories: Class A and Class B. In an ideal scenario the division of the data point classifications would be equal between the two categories, e.g.:

  • Class A accounts for 50% of the dataset.
  • Class B accounts for the other 50% of the dataset.

With the above scenario we could sufficiently measure the performance of a classification model using classification accuracy.

Unfortunately, this ideal balance often isn't the case when working with real- world problems, e.g. where categories of interest may occur less often. For context, here are some examples:

  • Credit Card Fraud : The majority of credit card transactions are genuine, whereas the minority of credit card transactions are fraudulent.
  • Medical Scans : The majority of medical scans are normal, whereas the minority of medical scans indicate something pathological.
  • Weapons Detection: The majority of body scans are normal, whereas the minority of body scans detect a concealed weapon.

An imbalanced dataset could consist of data points divided as follows:

  • Class A accounts for 90% of the dataset.
  • Class B accounts for 10% of the dataset.

Let's say in this case that Class B represents the suspect categories, e.g. a weapon/disease/fraud has been detected. If a model scored a classification accuracy of 90% , we may decide we're happy. After all, the model appears to be correct 90% of the time.

However, this measurement is misleading when dealing with an imbalanced dataset. Another way to look at it: I could write a simple function which simply classified everything as Class A , and also achieve a classification accuracy of 90% when tested against this imbalanced dataset.

function classify_data(data):
    return "Class A"

Unfortunately, my solution is useless for detecting anything meaningful in the real-world.

This is a problem.

Oversampling

One solution to this problem is to use a sampling technique to either:

  • Oversample - this will create new synthetic samples that simulate the minority class to balance the dataset.
  • Undersample - this will remove samples from the majority class according to some scheme to balance the dataset.

For this article we will focus on oversampling to create a balanced training set for a machine learning algorithm. Because this involves creating synthetic samples, it is important not to include these in the test set. Testing of a model must rely entirely on the real data.

It's also important to note that oversampling is not always a suitable solution to the imbalanced dataset problem. This depends entirely on factors such as the characteristics of the dataset, the problem domain, etc.

Coded example

Let's demonstrate the oversampling approach using a dataset and some Python libraries. We will be employing the imbalanced-learn package which contains many oversampling and under-sampling methods. A handy feature is its great compatibility with scikit-learn. Specifically, we will be using the Adaptive Synthetic (ADASYN) over-sampling method based on the publication below, but other popular methods, e.g. the Synthetic Minority Oversampling Technique (SMOTE), may work just as well.

He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322-1328, 2008.

First we begin by importing our packages. We have the usual suspects, numpy , matplotlib , and scikit-learn , with the addition of the new package which contains implementations of sampling methods: imblearn. If you 're using a pre-configured environment, e.g. Kaggle Kernels, Anaconda, or various Docker images, then it is likely you will need to install imblearn before you can import it.

import numpy as np # linear algebra
from numpy import genfromtxt

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

from sklearn.decomposition import PCA
from imblearn.over_sampling import ADASYN

Moving forward we will need to load in our dataset. For this example, we have a CSV file, dataset.csv which contains our input variables, and the correct/desired classification labels.

# load the dataset
data = pd.read_csv("dataset.csv")

The quick invocation of the head() function gives us some idea about the form of the data, with input variables a to o , with the final column labelled class :

Confirming the balance of the dataset

Before we decide if the dataset needs oversampling, we need to investigate the current balance of the samples according to their classification. Depending on the size and complexity of your dataset, you could get away with simply outputting the classification labels and observing the balance.

The output of the above tells us that there is certainly an imbalance in the dataset, where the majority class, 1 , significantly outnumbers the minority class, 0. To be sure of this, we can have a closer look using value_counts().

data['class'].value_counts()

The output of which will be:

As you can see, value_counts() has listed the number of instances per class, and it appears to be exactly what we were expecting. With the knowledge that one class consists of 47 samples and the other consists of 457 samples, it is clear this is an imbalanced dataset. Let's visualise this before moving on. We're going to use Principle Component Analysis (PCA) through sklearn.decomposition.PCA() to reduce the dimensionality of our input variables for easier visualisation.

pca = PCA(n_components=2)
data_2d = pd.DataFrame(pca.fit_transform(data.iloc[:,0:16]))

The output of which will look something like:

The final column containing the classifications has been omitted from the transformation using [pandas.DataFrame.iloc](https://pandas.pydata.org/pandas- docs/version/0.17.0/generated/pandas.DataFrame.iloc.html). After the transformation we will add the classification label column to the DataFrame for use in visualisations coming later. We will also name our columns for easy reference:

data_2d= pd.concat([data_2d, data['class']], axis=1)
data_2d.columns = ['x', 'y', 'class']

This can be confirmed by outputting the DataFrame again:

With our DataFrame in the desirable form, we can create a quick scatterplot visualisation which again confirms the imbalance of the dataset.

ADASYN for oversampling

Using ADASYN through imblearn.over_sampling is straight-forward. An ADASYN object is instantiated, and then the fit_sample() method is invoked with the input variables and output classifications as the parameters:

ada = ADASYN()
X_resampled, y_resampled = ada.fit_sample(data.iloc[:,0:16], data['class'])

The oversampled input variables have been stored in X_resampled and their corresponding output classifications have been stored in y_resampled. Once again, we're going to restore our data into the DataFrame form for easy interrogation and visualisation:

data_oversampled = pd.concat([pd.DataFrame(X_resampled), pd.DataFrame(y_resampled)], axis=1)
data_oversampled.columns = data.columns

Using value_counts() we can have a look at the new balance:

Now we have our oversampled and more balanced dataset. Let's visualise this on a scatterplot using our earlier approach.

data_2d_oversampled = pd.DataFrame(pca.transform(data_oversampled.iloc[:,0:16]))
data_2d_oversampled= pd.concat([data_2d_oversampled, data_oversampled['class']], axis=1)
data_2d_oversampled.columns = ['x', 'y', 'class']

Similar to the last time, we've used PCA to reduce the dimensionality of our newly oversampled dataset for easier visualisation We've also restored the data into a DataFrame with the desired column names. If we plot this data, we can see there is no longer a significant majority class:

Conclusion

In this article we've had a quick look at the problem of imbalanced datasets and suggested one approach to the problem through oversampling. We've implemented a coded example which applied ADASYN to an imbalanced dataset and visualised the difference before and after. If you plan to use this approach in practice, don't forget to first split your data into the training and testing sets before applying oversampling techniques to the training set only.

Original

Oversampled

Part 4 - Results Analysis

Overview

This will be the fourth article in a four-part series covering the following:

  1. Dataset analysis - We will present and discuss a dataset selected for our machine learning experiment. This will include some analysis and visualisations to give us a better understanding of what we're dealing with.

  2. Experimental design - Before we conduct our experiment, we need to have a clear idea of what we're doing. It's important to know what we're looking for, how we're going to use our dataset, what algorithms we will be employing, and how we will determine whether the performance of our approach is successful.

  3. Implementation - We will use the Keras API on top of TensorFlow to implement our experiment. All code will be in Python, and at the time of publishing everything is guaranteed to work within a Kaggle Notebook.

  4. Results - Supported by figures and statistics, we will have a look at how our solution performed and discuss anything interesting about the results.

Results

In the last article we prepared our dataset such that it was ready to be fed into our neural network training and testing process. We then built and trained our neural network models using Python and Keras, followed by some simple automation to generate thirty samples per arm of our experiment. Now, we'll have a look at how our solutions performed and discuss anything interesting about the results. This will include some visualisation, and we may even return to our experiment code to produce some new results.

Let's remind ourselves of our testable hypothesis:

Hypothesis : A neural network classifier's performance on the Iris Flower dataset is affected by the number of hidden layer neurons.

When we test our hypothesis, there are two possible outcomes:

  • $H_0$ the null hypothesis : insufficient evidence to support hypothesis.

  • $H_1$ the alternate hypothesis : evidence suggests the hypothesis is likely true.

Strictly speaking, our experiments will not allow us to decide on an outcome. Our experimental arm uses the same structure to the control arm except for one variable, and that is the number of neurons on the hidden layer changing from four to five. Therefore, we are only testing to see if this change affects the performance of the neural network classifier.

Loading the results

Similar to the last three parts of this series, we will be using a Kaggle Kernel notebook as our coding environment. If you have saved your files to a file using a Kaggle Notebook, then you will need to load the data file into your draft environment as a data source. It’s not immediately obvious where the files have been stored, but you can locate them by repeating the following steps:

Once you have the data in your environment, use the following code to load the data into variables. You will need to adjust the parameters for read_csv() to match your filenames.

Note

Below you will see iris-flower-dataset-classifier-comparison used in the pathnames, be sure to use the correct pathnames for your own experiment.

results_control_accuracy = pd.read_csv("/kaggle/input/iris-flower-dataset-classifier-comparison/results_control_accuracy.csv")

results_experimental_accuracy = pd.read_csv("/kaggle/input/iris-flower-dataset-classifier-comparison/results_experimental_accuracy.csv")

If you don't have access to the results generated from the previous article, then you are welcome to use my results with the following code:

results_control_accuracy = pd.DataFrame([0.9333333359824286, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.6000000052981906, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9111111124356588, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9111111124356588])

results_experimental_accuracy = pd.DataFrame([0.9111111124356588, 0.9555555568801032, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.933333334657881, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.933333334657881, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9333333359824286, 0.9777777791023254, 0.9777777791023254, 0.9333333359824286, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254])

Basic stats

With our results loaded, we can get some quick stats to start comparing the performance differences between the two arms of our experiment.

We can start by comparing the mean performance of the control arm against the experimental arm. Using pandas and numpy , we can write the following code:

mean_control_accuracy = results_control_accuracy.mean()
print("Mean Control Accuracy: {}".format(mean_control_accuracy))

mean_experimental_accuracy = results_experimental_accuracy.mean()
print("Mean Experimental Accuracy: {}".format(mean_experimental_accuracy))

The output of which will be the following if you've used the data provided above:

At this point, it may be tempting to claim that the results generated by the experimental arm of our experiment has outperformed that of the control arm. Whilst it's true that the mean accuracy of the 30 samples from the experimental arm is higher than the control arm, we are not yet certain of the significance of these results. At this point, this difference in performance could have occurred simply by chance, and if we generate another set of 30 samples the results could be the other way around.

Before moving on, it may also be useful to report the standard deviation of the results in each arm of the experiment:

std_control_accuracy = results_control_accuracy.std()
print("Standard Deviation of Control Accuracy Results: {}".format(std_control_accuracy))

std_experimental_accuracy = results_experimental_accuracy.std()
print("Standard Deviation of Experimental Accuracy Results: {}".format(std_experimental_accuracy))

The output of which will be the following if you've used the data provided above:

Visualising the results

Moving onto visualisations, one common plot used to compare this type of data is the box plot, which we can produce using pandas.DataFrame.boxplot(). Before we do this, we need to move the results from both arms of our experiment into a single DataFrame and name the columns.

results_accuracy= pd.concat([results_control_accuracy, results_experimental_accuracy], axis=1)
results_accuracy.columns = ['Control', 'Experimental']

If we print out this new variable, we can see all our results are now in a single DataFrame with appropriate column headings:

We can produce a box plot using a single line of code:

results_accuracy.boxplot()

The output of which will be the following if you've used the data provided above:

However, the scale of the plot has made it difficult to compare the two sets of data. We can see the problem with our own eyes, and it's down to one of the samples from the Control arm of the experiment. One of the samples appears to be around $0.60$ which is a clear outlier.

With pandas we can try two approaches to remove this outlier from view and get a better look. One is not to plot the outliers using the showfliers parameter for the box plot method:

results_accuracy.boxplot(showfliers=False)

Which will output:

Or to instead specify the y-axis limits for the box plot:

ax = results_accuracy.boxplot()
ax.set_ylim([0.9,1])

Which will output:

Distribution of the data

It may also be useful to find out if our results are normally distributed, as this will also help us decide what parametric or non-parametric tests to use. You may be able to make this decision using a histogram:

results_accuracy.hist(density=True)

Which will output the following:

But normally you will want to use a function that will test the data for you. One approach is to use scipy.stats.normaltest():

This function tests the null hypothesis that a sample comes from a normal distribution.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html

This function will return two variables, one called the statistic and most importantly for us, the p-value , which is the probability of the hypothesis test. A p-value , always between 0 and 1, indicates the strength of evidence against the null hypothesis. A smaller p-value indicates greater evidence against null hypothesis, whilst a larger p-value indicates weaker evidence against the null hypothesis.

For this test, the null hypothesis is that the samples do not come from a normal distribution. Before using the test, we need to decide on a value for alpha , our significance level. This is essentially the “risk” of concluding a difference exists when it doesn’t, e.g., an alpha of $0.05$ indicates a 5% risk. We can consider alpha to be some kind of threshold. This will be covered in more detail in another article, but for now we will set $0.05$ as our alpha. This means if our p-value is less than $0.05$, the null hypothesis has been rejected and the samples are likely not from a normal distribution. Otherwise, the null hypothesis cannot be rejected, and the samples are likely from a normal distribution.

Let's write some code to determine this for us:

from scipy import stats

alpha = 0.05;

s, p = stats.normaltest(results_control_accuracy)
if p < alpha:
  print('Control data is not normal')
else:
  print('Control data is normal')

s, p = stats.normaltest(results_experimental_accuracy)
if p < alpha:
  print('Experimental data is not normal')
else:
  print('Experimental data is normal')

The output of which will be the following if you've used the data provided above:

Significance testing

Finally, let's test the significance of our pairwise comparison. The significance test you select depends on the nature of your data-set and other criteria, e.g. some select non-parametric tests if their data-sets are not normally distributed. We will use the Wilcoxon signed-rank test through the following function: scipy.stats.wilcoxon():

The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

This will give us some idea as to whether the results from the control arm are significantly different from those of the experimental arm. This will again return a p-value , and we will compare it with an alpha of $0.05$.

s, p = stats.wilcoxon(results_control_accuracy[0], results_experimental_accuracy[0])

if p < 0.05:
  print('null hypothesis rejected, significant difference between the data-sets')
else:
  print('null hypothesis accepted, no significant difference between the data-sets')

The output of which will be the following if you've used the data provided above:

This means that although the mean accuracy of our experimental arm samples outperforms the mean performance of our control arm, this is likely to be purely by chance. We cannot say that one is better than the other.

Conclusion

In the article we had a look at how our solutions performed and using some simple statistics and visualisations. We also tested whether our results came from a normal distribution, and whether results from both arms of our experiment were significantly different from each other. Through significance testing we determined that we were not able to claim that one arm of the experiment outperformed the other, despite the mean performances being different. Regardless, a result is a result, and we can extend our experiment to include a range of neurons per hidden layer instead of four compared five.

Thank you for following this four-part series on Machine Learning with Kaggle Notebooks. If you notice any mistakes or wish to make any contributions, please let me know either using the comments or by e-mail.

Part 3 - Coded Implementation

Overview

This will be the third article in a four-part series covering the following:

  1. Dataset analysis - We will present and discuss a dataset selected for our machine learning experiment. This will include some analysis and visualisations to give us a better understanding of what we're dealing with.

  2. Experimental design - Before we conduct our experiment, we need to have a clear idea of what we're doing. It's important to know what we're looking for, how we're going to use our dataset, what algorithms we will be employing, and how we will determine whether the performance of our approach is successful.

  3. Implementation - We will use the Keras API on top of TensorFlow to implement our experiment. All code will be in Python, and at the time of publishing everything is guaranteed to work within a Kaggle Notebook.

  4. Results - Supported by figures and statistics, we will have a look at how our solution performed and discuss anything interesting about the results.

Implementation

In the last article we covered a number of experimental design issues and made some decisions for our experiments. We decided to compare the performance of two simple artificial neural networks on the Iris Flower dataset. The first neural network will be the control arm, and it will consist of a single hidden layer of four neurons. The second neural network will be the experimental arm, and it will consist of a single hidden layer of five neurons. We will train both of these using default configurations supplied by the Keras library and collect thirty accuracy samples per arm. We will then apply the Wilcoxon Rank Sums test to test the significance of our results.

A simple training and testing strategy

With our dataset analysis and experimental design complete, let's jump straight into coding up the experiments.

If your desired dataset is hosted on Kaggle, as it is with the Iris Flower Dataset, you can spin up a Kaggle Notebook easily through the web interface:

Creating a Kaggle Notebook with the Iris dataset ready for use.

You're also welcome to use your own development environment, provided you can load the Iris Flower dataset.

Import packages

Before we can make use of the many libraries available for Python, we need to import them into our notebook. We're going to need numpy , pandas , tensorflow , keras , and sklearn. Depending on your development environment these may already be installed and ready for importing. You 'll need to install them if that's not the case.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import tensorflow as tf # dataflow programming
from tensorflow import keras # neural networks API
from sklearn.model_selection import train_test_split # dataset splitting

If you're using a Kaggle Kernel notebook you can just update the default cell. Below you can see I've included imports for tensorflow , keras , and sklearn.

To support those using their own coding environment, I have listed the version numbers for the imported packages below:

  • tensorflow==1.11.0rc1
  • scikit-learn==0.19.1
  • pandas==0.23.4
  • numpy==1.15.2

Preparing the dataset

First, we load the Iris Flower dataset into a pandas DataFrame using the following code:

# Load iris dataset into dataframe
iris_data = pd.read_csv("/kaggle/input/Iris.csv")

Input parameters

Now, we need to separate the four input parameters from the classification labels. There are multiple ways to do this, but we're going to use pandas.DataFrame.iloc, which allows selection from the DataFrame using integer indexing.

# Splitting data into training and test set
X = iris_data.iloc[:,1:5].values

With the above code we have selected all the rows (indicated by the colon) and the columns at index 1, 2, 3, and 4 (indicated by the 1:5). You may be wondering why the fifth column was not included, as we specified 1:5, that's because in Python we're counting from one up to five, but not including it. If we wanted the fifth column, we'd need to specify 1:6. It's important to remember that Python's indexing starts at 0, not 1. If we had specified 0:5, we would also be selecting the "Id" column.

To remind ourselves of what columns are at index 1, 2, 3, and 4, let's use the pandas.DataFrame.head() method from the first part.

Samples from the Iris Flower dataset with the column indices labelled in red.

We can also print out the contents of our new variable, X, which is storing all the Sepal Length/Width and Petal Length/Width data for our 150 samples. This is all of our input data.

The input data selected from the Iris Flower dataset.

For now, that is all the processing needed for the input parameters.

Classification labels

We know from our dataset analysis in part 1 that our samples are classified into three categories, " Iris-setosa ", " Iris-virginica ", and " Iris- versicolor ". However, this alphanumeric representation of the labels is not compatible with our machine learning functions, so we need to convert them into something numeric.

Again, there are many ways to achieve a similar result, but let's use pandas features for categorical data. By explicitly selecting the Species column from our dataset as being of the category datatype, we can use pandas.Series.cat.codes to get numeric values for our class labels.

We have one extra step, because we plan on using the categorical_crossentropy objective function to train our model. The Keras documentation gives the following instructions:

When using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).

Keras Documentation (https://keras.io/losses)

What this means is we will need to use One-hot encoding. This is quite typical for categorical data which is to be used with machine learning algorithms. Here is an example of One-hot encoding using the Iris Flower dataset:

One-hot encoding of the Iris Flower dataset class labels.

You can see that each classification label has its own column, so Setosa is \(1,0,0\), Virginica is \(0,1,0\), and Versicolor is \(0,0,1\).

Luckily encoding our labels using Python and Keras is easy, and we've already completed the first step which is converting our alphanumeric classes to numeric ones. To convert to One-hot encoding we can use keras.utils.to_categorical():

# Use One-hot encoding for class labels
Y = keras.utils.to_categorical(y,num_classes=None)

Training and testing split

In the previous part of this series we decided on the following:

The Iris Flower dataset is relatively small at exactly 150 samples. Because of this, we will use 70% of the dataset for training, and the remaining 30% for testing, otherwise our test set will be a little on the small side.

Machine Learning with Kaggle Notebooks - Part 2

This is where sklearn.model_selection.train_test_split() comes in. This function will split our dataset into a randomised training and testing subset:

# split into randomised training and testing subset
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state=0)

This code splits the data, giving 30% (45 samples) to the testing set and the remaining 70% (105 samples) for the training set. The 30/70 split is defined using test_size=0.3 and random_state=0 defines the seed for the randomisation of the subsets.

These have been spread across four new arrays storing the following data:

  • X_train : the input parameters, to be used for training.

  • y_train : the classification labels corresponding to the X_train above, to be used for training.

  • X_test : the input parameters, to be used for testing.

  • y_test : the classification labels corresponding to the X_test above, to be used for testing.

Before moving on, I recommend you have a closer look at the above four variables, so that you understand the division of the dataset.

Neural networks with Keras

Keras is the software library we will be using through Python, to code up and conduct our experiments. It's a user friendly high-level neural networks library which in our case will be running on top of TensorFlow. What is most attractive about Keras is how quickly you can go from your design to the result.

Configuring the model

The keras.Sequential() model allows you to build a neural network by stacking layers. You can add layers using the add() method, which in our case will be Dense() layers. A dense layer is a layer in which every neuron is connected to every neuron in the next layer. Dense() expects a number of parameters, e.g. the number of neurons to be on the layer, the activation function, the input_shape (if it is the first layer in the model), etc.

model = keras.Sequential()

model.add(keras.layers.Dense(4, input_shape=(4,), activation='tanh'))
model.add(keras.layers.Dense(3, activation='softmax'))

In the above code we have created our empty model and then added two layers, the first is a hidden layer consisting of four neurons which are expecting four inputs. The second layer is the output layer consisting of our three output neurons.

We then need to configure our model for training, which is achieved using the compile() method. Here we will specify our optimiser to be Adam(), configure for categorical classification, and specify our use of accuracy for the metric.

model.compile(keras.optimizers.Adam(), 'categorical_crossentropy', metrics=['accuracy'])

At this point, you may wish to use the summary() method to confirm you've built the model as intended:

Training the model

Now comes the actual training of the model! We're going to use the fit() method of the model and specify the training input data and desired labels, the number of epochs (the number of times the training algorithm sees the entire dataset), a flag to set the verbosity of the process to silent. Setting the verbosity to silent is entirely optional, but it helps us manage the notebook output.

model.fit(X_train, y_train, epochs=300, verbose=0)

If you're interested in receiving more feedback during the training (or optimisation) process, you can remove the assignment of the verbose flag when invoking the fit() method to use the default value. Now when the training algorithm is being executed, you will see output at every epoch:

Testing the model

After the neural network has been trained, we want to evaluate it against our test set and output its accuracy. The evaluate() method returns a list containing the loss value at index 0 and in this case, the accuracy metric at index 1.

accuracy = model.evaluate(X_test, y_test)[1]

If we run all the code up until this point, and we output the contents of our accuracy variable, we should see something similar to the following:

Generating all our results

Up until this point, we have successfully prepared the Iris Flower dataset, configured our model, trained our model, evaluated it using the test set, and reported its accuracy. However, this reported accuracy is only one sample of our desired thirty.

We can do this with a simple loop to repeat the process thirty times, and a list to store all the results. This only requires some minor modifications to our existing code:

results_control_accuracy = []
for i in range(0,30):
    model = keras.Sequential()
    model.add(keras.layers.Dense(4, input_shape=(4,), activation='tanh'))
    model.add(keras.layers.Dense(3, activation='softmax'))

    model.compile(keras.optimizers.Adam(lr=0.04),'categorical_crossentropy',metrics=['accuracy'])

    model.fit(X_train,y_train,epochs=100, verbose=0)

    accuracy = model.evaluate(X_test, y_test)[1]
    results_control_accuracy.append(accuracy)

print(results_control_accuracy)

This will take a few minutes to execute depending on whether you're using a Kaggle Kernel notebook or your own development environment, but once it has you should see a list containing the accuracy results for all thirty of the executions (but your results will vary):

[0.9333333359824286, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.6000000052981906, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9111111124356588, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9111111124356588]

These are the results for our control arm, let's now do the same for our experimental arm. The experimental arm only has one difference: the number of neurons on the hidden layer. We can re-use our code for the control arm and just make a single modification where:

model.add(keras.layers.Dense(4, input_shape=(4,), activation='tanh'))

is changed to:

model.add(keras.layers.Dense(5, input_shape=(4,), activation='tanh'))

Of course, we'll also need to change the name of the list variable so that we don't overwrite the results for our control arm. The code will end up looking like this:

results_experimental_accuracy = []
for i in range(0,30):
    model = keras.Sequential()
    model.add(keras.layers.Dense(5, input_shape=(4,), activation='tanh'))
    model.add(keras.layers.Dense(3, activation='softmax'))

    model.compile(keras.optimizers.Adam(lr=0.04),'categorical_crossentropy',metrics=['accuracy'])

    model.fit(X_train,y_train,epochs=100, verbose=0)

    accuracy = model.evaluate(X_test, y_test)[1]
    results_experimental_accuracy.append(accuracy)

print(results_experimental_accuracy)

After executing the above and waiting a few minutes, we will have our second set of results:

[0.9111111124356588, 0.9555555568801032, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.933333334657881, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.933333334657881, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9777777791023254, 0.9333333359824286, 0.9777777791023254, 0.9777777791023254, 0.9333333359824286, 0.9777777791023254, 0.9555555568801032, 0.9777777791023254, 0.9777777791023254]

Saving the results

The results for our experiment have been generated, and it's important that we save them somewhere, so we that can use them later. There are multiple approaches to saving or persisting your data, but we are going to make use of pandas.DataFrame.to_csv():

pd.DataFrame(results_control_accuracy).to_csv('results_control_accuracy.csv', index=False)

pd.DataFrame(results_experimental_accuracy).to_csv('results_experimental_accuracy.csv', index=False)

The above code will save your results to individual files corresponding to the arm of the experiment. Where the files go depend entirely on your development environment. If you're developing in your own local environment, then you will likely find the files in the same folder as your notebook or script. If you're using a Kaggle Notebook, it is important that you click the blue commit button in the top right of the page.

It will take a few minutes to commit your notebook but once it's done, you know your file is safe. It's not immediately obvious where the files have been stored, but you can double check their existence by repeating the following steps:

Conclusion

In this article we prepared our dataset such that it was ready to be fed into our neural network training and testing process. We then built and trained our neural network models using Python and Keras, followed by some simple automation to generate thirty samples per arm of our experiment.

In the next part of this four-part series, we will have a look at how our solutions performed and discuss anything interesting about the results. This will include some visualisation, and we may even return to our experiment code to produce some new results.

Part 2 - Experimental Design

Overview

This will be the second article in a four part series covering the following:

  1. Dataset analysis - We will present and discuss a dataset selected for our machine learning experiment. This will include some analysis and visualisations to give us a better understanding of what we 're dealing with.
  2. Experimental design - Before we conduct our experiment, we need to have a clear idea of what we 're doing. It's important to know what we're looking for, how we're going to use our dataset, what algorithms we will be employing, and how we will determine whether the performance of our approach is successful.
  3. Implementation - We will use the Keras API on top of TensorFlow to implement our experiment. All code will be in Python, and at the time of publishing everything is guaranteed to work within a Kaggle Notebook.
  4. Results - Supported by figures and statistics, we will have a look at how our solution performed, and discuss anything interesting about the results.

Experimental design

In the last article we introduced and analysed the Iris Flower Dataset within a Kaggle Kernel notebook. In this article we will design our experiments, and select some algorithms and performance measures to support our implementation and discussion of the results.

Visualisation from the last article: a parallel coordinate plot of our dataset grouped by Species.

Before moving onto the application of machine learning algorithms, we first need to design our experiment. Doing so will give us a clear idea of what we're looking for, how we intend to get there, and how we will decide the outcome. This process helps us when reporting our findings, whether it's to your stakeholders or a peer-reviewed paper.

This series is aimed at beginners, and we know the Iris Flower dataset is an easy one, so we will keep our idea simple. Let's design an experiment to find out whether the same machine learning algorithm will offer significantly different performance, if we change just one part of the algorithm's configuration. The task we want the solution to complete is multi-class classification, i.e. classifying an Iris Flower dataset sample into one of the three species. In this context, each one of the species can also be referred to as a class.

Algorithm selection

Selecting the right algorithm or set of algorithms for an experiment is often a non-trivial task. However, for this easy problem there is little that is unknown, so we can select something simple without much consideration. For our purposes we will select an Artificial Neural Network learning algorithm, with one criterion: that it is easy to implement using Keras. Keras is the software library we will be using through Python, to code up and conduct our experiments - more on this in the next article.

We will cover Artificial Neural Networks in a different article, but for now let's treat them as a black box , which we will refer to as neural network from now on. Put simply, they are networks of interconnected nodes arranged into layers. A neural network can be used for classification, regression, and clustering. This means it fits our requirement to classify our samples. When pass in our four input parameters (sepal length and width, petal length and width), we will expect an output of either Setosa, Versicolour, or Virginica.

Our black box neural network for classifying Iris Flower dataset samples into one of the three species.

Before our neural network can classify the inputs into what it thinks is the right output, it needs to be trained. Training usually involves adapting the strength of the connections between nodes within a neural network, often referred to as the weights. By continuously adapting these weights, which are used in calculations involving the input samples, a learning algorithm can try to learn the classification problem. The desirable output is a classifier that can map input samples to the desired/correct output.

Many variables effect the performance, e.g. in terms of accuracy, of a neural network classifier, these include:

  • Topology/Structure:
  • Number of Hidden Layers
  • Number of Neurons per Hidden Layer
  • Trained Parameters:
  • Connection Weights
  • Neuron Biases

The above is not an exhaustive list, but let's visualise these variables.

An illustration of a feed-forward neural network classifier. Where $x_1 \ldots x_4$ are the input parameters for a single sample, $w_1 \ldots w_n$ are the weights, $b_1 \ldots b_n$ and $a_1 \ldots a_n$ are the biases and activation functions respectively, and $y_1 \ldots y_3$ are the outputs.

In the above figure we have peered into our black box, and we can see there are four inputs labelled \(x_1\) to \(x_4\) , 2 hidden layers each with 3 neurons, and three outputs labelled \(y_1\) to \(y_3\). To keep things simple, we will only use a neural network with a single hidden layer in our experiments.

Scientific control

There are no hard or fast rules for defining the structure of a neural network. However, a general piece of advice is that you should have at least the same number of neurons on a hidden layer as you do on the input layer. This means for our Iris Flower dataset; our hidden layer should consist of four neurons. We will aim to create a neural network with this structure and use the default configurations provided by Keras for everything else. This will be the control arm of our experiment. We will then train the neural network and measure its performance.

We will then run the same experiment again, however this time we will change a single configuration related to the structure of the neural network. We will change the number of hidden layer neurons from four, to five. This will be our experimental arm. This is a controlled experiment, and by only changing one variable we aim to eliminate any other explanations for why the two arms of our experiment perform differently. When analysing our results, we want to be able to say with some confidence: "changing the number of neurons on the hidden layer made a significant difference to the performance of our neural network", and then perhaps observations around which classifier performs better.

Testable hypotheses

Let's define a testable hypothesis. Our hypothesis is a proposed explanation made on the basis of some limited or preliminary evidence, and this will be a starting point for further investigation. We should phrase it such that it is testable, and not ambiguous.

Hypothesis : A neural network classifier's performance on the Iris Flower dataset is affected by the number of hidden layer neurons.

When we test our hypothesis, there are two possible outcomes:

  • \(H_0\) the null hypothesis : insufficient evidence to support hypothesis.

  • \(H_1\) the alternate hypothesis : evidence suggests the hypothesis is likely true.

If the null hypothesis is accepted, meaning there's not enough evidence to support our hypothesis, it suggests that there is no significant difference in performance when changing the number of hidden layer neurons. However, if we reject the null hypothesis and accept the alternate hypothesis, this suggests that it is likely there's a significant difference in performance when changing the number of hidden layer neurons.

Note

In order to sufficiently test our hypothesis, we need to have multiple experiments with different numbers of neurons on the hidden layer. To keep it simple, we 're only comparing an experiment with four neurons on the hidden layer, against one with five.

Sample size sufficiency

Neural network training algorithms are stochastic, and what this means is that they employ random numbers. One step that often employs random numbers is the initialisation of weights, which when optimised using some training algorithm will offer different performance every time. This often catches beginners out, and must be taken into consideration when designing an experiment.

To demonstrate, I went ahead and set up a neural network within Keras with default configurations, and a single hidden layer consisting of 12 neurons. I executed the same code 10 individual times, and here are the accuracy results:

Accuracy
0.9777777777777777
0.9888888888888889
0.9555555489328172
0.9888888809416029
0.9666666666666667
1.0
0.9888888809416029
0.9888888888888889
0.9777777791023254
0.9666666666666667

As we can see from these preliminary results, the same algorithm with identical configuration has produced 10 different classifiers which all offer varying performance. By chance, some of the executions have produced classifiers which have reported the same performance as others, but these are likely to have entirely different weights. There is a way to get around this and always reproduce an identical neural network classifier, and that's by using the same seed for your random number generators. However, this doesn't tell us how good an algorithm is at training a neural network.

The real solution is to determine a sufficient sample size for your experiment. In this context, we're referring to each execution of your algorithm as a sample, and not the samples within the dataset as we were previously. There are many methods to determine a sufficient sample size, however, for the purpose of this series we will simply select a sample size of 30, which is a number often selected in the literature. This means we will be executing each algorithm 30 times, and then using some statistics to determine performance.

Significance testing

Now that we are comparing a population of 30 performance measures per arm of our experiment, instead of just a reading, we will need to rely on something like mean average of both arms to measure a difference or if one outperforms the other. However, the mean average alone is not enough evidence to reject the null hypothesis. Along with a sufficient sample size, we need to determine whether we need to run a parametric or non-parametric test. These tests should give us an indication of whether any difference between our two sets of results was "significant" or "by chance".

Simple steps for a pairwise comparison of machine learning results

Parametric tests typically work well on sets that follow a normal distribution, and non-parametric tests typically work well on sets that don't. This is a topic we'll visit in a different article in the future, but for now we will select a non-parametric test called the Wilcoxon Rank Sums test.

This test aligns with our hypothesis, as it is used to determine whether two set of samples come from different distributions. Through this test, we can find out if it's likely or not that there's a significant difference between our two experiment arms.

Training and Testing Strategy

We need to decide on how we're going to use our data to train and then test our models. The easiest way is to divide the dataset samples into a training set and testing set, often in an 80-20 split. The training samples will be used by the learning algorithms to find the optimal weights. The testing set is used only to assess the performance of a fully-trained model, and must remain entirely unseen by the learning algorithm. The test set can be seen as a portion of the dataset which is kept in a "vault" until the very end. After assessing the model with the testing set, there can be no further tuning unless there are new and unseen samples which can be used for testing.

A simple training and testing strategy

The Iris Flower dataset is relatively small at exactly 150 samples. Because of this, we will use 70% of the dataset for training, and the remaining 30% for testing, otherwise our test set will be a little on the small side.

Performance Evaluation

Let's decide on how we will evaluate the performance of our experiments, and how we will interpret the results. First, let's define what we mean by true positives, false positives, true negatives, and false negatives, as these are used in the calculations for many of the popular metrics.

After a neural network has been trained using the training set, its predictive performance is tested using the test set. Through this exercise, we are able to see if the classifier was able to correctly classify each sample. In a binary classification problem where the output can only be true or false, we can use something called the confusion matrix:

A confusion matrix can be used to present the number of samples that were correctly or incorrectly classified.

True Positive (TP)
belong to the positive class and were correctly classified as the positive class.
False Positive (FP)
belong to the negative class and were incorrectly classified as the positive class.
True Negative (TN)
belong to the negative class and were correctly classified as the negative class.
False Negative (FN)
belong to the positive class and were incorrectly classified as the negative class.

However, in a multi-class classification problem the confusion matrix would look more like the following:

From these counts, we can calculate the following measures:

Accuracy
The percentage of samples correctly identified:

\[Accuracy = \frac{tp + tn}{tp + tn + fp + fn}\]

Precision (Positive predictive value)
Proportion of positives accurately identified:

\[Precision = \frac{tp}{tp + fp}\]

Recall (Sensitivity, True Positive Rate)
Proportion of positives correctly identified:

\[Recall = \frac{tp}{tp + fn}\]

True negative rate (Specificity)
Proportion of negatives correctly identified:

\[{True negative rate} = \frac{tn}{tn + fp}\]

F-measure
Balance between precision and recall:

\[F-measure = \frac{2 \times sensitivity \times precision}{sensitivity + precision}\]

These metrics all have their uses and are often used in combination to determine the performance of a classifier. The Iris Flower dataset is simple enough for us to use the classification accuracy as our measurement for pairwise comparison. This means for each arm of the experiment, we will complete 30 independent executions and build a set of classification accuracies, and use the mean average and significance testing to check our hypothesis. In addition, we will also report on the above metrics to see how they work.

Conclusion

In this article we've covered a breadth of experimental design issues. We have selected a simple neural network for our experiment, with the control and experiment arm only differing by one neuron on the hidden layer. We discussed the importance of using a sufficient sample size and conducting significance testing, and strategies for splitting the dataset for training and testing. Finally, we decided how we were going to measure our experiment and how we would report the results.

In the next part of this four part series, we will write all the code for the experiments using Python and the Keras API on top of TensorFlow, and collect our results in preparation for the final discussion.

Part 1 - Dataset Analysis

Overview

This will be the first article in a four part series covering the following:

  1. Dataset analysis - We will present and discuss a dataset selected for our machine learning experiment. This will include some analysis and visualisations to give us a better understanding of what we're dealing with.
  2. Experimental design - Before we conduct our experiment, we need to have a clear idea of what we're doing. It's important to know what we're looking for, how we're going to use our dataset, what algorithms we will be employing, and how we will determine whether the performance of our approach is successful.
  3. Implementation - We will use the Keras API on top of TensorFlow to implement our experiment. All code will be in Python, and at the time of publishing everything is guaranteed to work within a Kaggle Notebook.
  4. Results - Supported by figures and statistics, we will have a look at how our solution performed, and discuss anything interesting about the results.

Dataset analysis

In the last article we introduced Kaggle's primary offerings and proceeded to execute our first " Hello World" program within a Kaggle Notebook. In this article, we're going to move onto conducting our first machine learning experiment within a Kaggle Kernel notebook.

Hello World from the previous article.

To facilitate a gentle learning experience, we will try to rely on (relatively) simple/classic resources in regard to the selected dataset, tools, and algorithms. This article assumes you know what Kaggle is, and how to create a Kaggle Notebook.

Note

Although this article is focussed on the use of Kaggle Notebooks, this experiment can be reproduced in any environment with the required packages. Kaggle Notebooks are great because you can be up and running in a few minutes!

Put simply, if we want to conduct a machine learning experiment, we typically need something to learn from. Often, this will come in the form of a dataset, either collected from some real-world setting, or synthetically generated e.g. as a test dataset.

Iris Flower Dataset

To keep things manageable, we will rely on the famous tabular Iris Flower Dataset , created by Ronald Fischer. This is one of the most popular datasets in existence, and has been used in many tutorials/examples found in the literature.

The Iris Flower.

You can find the dataset within the UCI Machine Learning Repository, and it's also hosted by Kaggle. The multivariate dataset contains 150 samples of the following four real-valued attributes:

  • sepal length,
  • sepal width,
  • petal length,
  • and petal width.

All dimensions are suppled in centimetres. Associated with every sample is also the known classification of the flower:

  • Setosa,
  • Versicolour,
  • or Virginica.

Helpful diagram presenting the 4 attributes and 3 classifications in the Iris dataset.

Typically, this dataset is used to produce a classifier which can determine the classification of the flower when supplied with a sample of the four attributes.

Notebook + Dataset = Ready

Let's have a closer look at the dataset using a Kaggle Notebook. If your desired dataset is hosted on Kaggle, as it is with the Iris Flower Dataset, you can spin up a Kaggle Notebook easily through the web interface:

Creating a Kaggle Kernel with the Iris dataset ready for use.

Once the notebook environment has finished loading, you will be presented with a cell containing some default code. This code does two things:

  1. Import Packages - You will see three import statements which load the os (various operating system tasks), numpy (for linear algebra), and pandas (for data processing)_ packages. Unless you plan on implementing everything from scratch (re-inventing the wheel), then you will be making use of excellent and useful packages that are available for free.
  2. Directory Listing - One of the first difficulties encountered by new users of Kaggle Notebooks and similar platforms is: "How do I access my dataset?" or "Where are my dataset files?". Kaggle appear to have pre-empted these questions with their default code which displays a directory listing of their input directory, ../input. When executing this default cell you will be presented with a list of files in that directory, so you know that using the path ../input/[filename] will point to what you want. Execute this code on our Iris Flower Kaggle Notebook and you will see Iris.csv amongst the files available. This file consisting of comma-separated values is what we will be using throughout our analysis, located at ../input/Iris.csv.

Executing the default code in a Kaggle Notebook.

Analysis with pandas

Now that we know where our dataset is located, let's load it into a DataFrame using pandas.read_csv. Then, to have a quick look at what the data looks like we can use the pandas.DataFrame.head() function to list the first 5 samples of our dataset.

iris_data = pd.read_csv("/kaggle/input/Iris.csv")
iris_data.head()

Using the head() function to list the first 5 samples of the Iris dataset.

The tabular output of the head() function shows us that the data has been loaded, and what 's helpful is the inclusion of the column headings which we can use to reference specific columns later on.

DataFrame.head( n=5 ). This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

Pandas documentation

Let's start interrogating our dataset. First, let's confirm we only have three different species classifications in our dataset, not more or less. There are many ways to achieve this, but one easy way is to use pandas.DataFrame.nunique() on the Species column.

iris_data['Species'].nunique()

Confirming the number of distinct classes in our dataset.

We've now confirmed the number of classes, another point of interest may be to find out the number of samples we have. Again, there are many ways to achieve this too, e.g. pandas.DataFrame.shape which returns the dimensionality of our dataset, and pandas.DataFrame.count() which counts non-NA cells along the specified axis (columns by default).

iris_data.shape        # display the dimensionality
iris_data.count()      # count non-NA cells

Some different ways to count samples in a dataset.

Now we can be sure there are no NA cells, and that we have the 150 samples that we were expecting. Let's move onto confirming the classification distribution of our samples, remembering that we are expecting a 50/50/50 split from the dataset description. One way to get this information is to use pandas.Series.value_counts(), which returns a count of unique values.

iris_data['Species'].value_counts()

Counting the number of samples for each classification.

As you can see, value_counts() has listed the number of instances per class, and it appears to be exactly what we were expecting.

Creating useful visualisations

Purely for the sake of starting some visualisation, let's create a bar-chart. There are many packages which enable visualisation with Python, and typically I rely on matplotlib. However, pandas has some easy to use functions which use matplotlib internally to create plots. Let's have a look at pandas.DataFrame.plot.bar(), which will create a bar-chart from the data we pass in.

iris_data['Species'].value_counts().plot.bar()

A simple bar-chart using pandas.DataFrame.plot.bar().

Nothing unexpected from this plot, as we know our dataset samples are classified into three distinct species in a 50/50/50 split, but now we can present this information visually.

Let's shift our attention to our four attributes, perhaps we're interested in the value range of each attribute, along with some indication of the statistics. It's always a good idea to have a quick look at the pandas documentation before attempting to code something up yourself, because it's often the case that pandas has a nice helper function for what you're after. In this case, we can use pandas.DataFrame.describe to generate some descriptive statistics for our dataset, and it actually includes some information we generated earlier in this article.

iris_data.describe()

Descriptive statistics for the Iris Flower Dataset.

What if we want to visualise our attributes to see if we can make any interesting observations? One approach is to use a parallel coordinate plot. Depending on the size of your dataset, it may not be feasible to use this type of plot as the output will be too cluttered and almost useless as a visualisation. However, with the Iris Flower Dataset being small, we can go ahead and use pandas.plotting.parallel_coordinates().

pd.plotting.parallel_coordinates(iris_data, "Species")

Parallel coordinate plot of our dataset grouped by Species.

The parallel coordinate plot has worked as intended, but in practice we can see there's an issue: the magnitude of the range of differences between our attributes has made the visualisation difficult to interpret. Of course, it would actually be a useful visualisation for illustrating this fact. In this case, we can see the Id attribute has been included in our plot, and we don't actually need or want this to be visualised. Removing this from the input to the plotting function may be all we need to do.

pd.plotting.parallel_coordinates(iris_data.drop("Id", axis=1), "Species")

Here we have used pandas.DataFrame.drop() to drop the Id column when we pass the Iris Flower Dataset to the parallel_coordinate function. This is done by specifying the first parameter, label, and the second parameter, axis=1 (to indicate columns).

Parallel coordinate plot of our dataset (dropping Id's) grouped by Species.

That plot looks much better, and we can now see something interesting in the clustering of the attributes with regard to the species classifications. What stands out the most is the Setosa species, which looks to have a different pattern compared to Versicolor and Virginica. This was actually mentioned in the dataset description.

One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

UCI Machine Learning Repository: Iris Data Set

It's still interesting to have a closer look, and it's good practice for when working on different - perhaps undocumented - datasets. To do this we will use a scatter plot matrix, which will show us pairwise scatter plots of all of our attributes in a matrix format. Pandas has pandas.plotting.scatter_matrix() which is intuitive until you want to colour markers by category (or species), so we will instead import a high-level visualisation library based on matplotlib called seaborn. From seaborn we can use seaboard.pairplot() to plot our pairwise relationships.

import seaborn as sns
sns.pairplot(iris_data.drop("Id",axis=1), hue="Species")

This is an incredibly useful visualisation which only required a single line of code, and we also have the kernel density estimates of the underlying features along the diagonal of the matrix. In this scatter plot matrix, we can confirm that the Setosa species is linearly separable, and if you wanted to you could draw a dividing line between this species and the other two. However, the Versicolor and Virginica classes are not linearly separable.

One final visualisation I want to share with you is the Andrew Curves plot for visualising clusters of multivariate data. We can plot this using pandas.plotting.andrews_curves().

pd.plotting.andrews_curves(iris_data.drop("Id", axis=1), "Species")

Conclusion

In this article we've introduced and analysed the Iris Flower Dataset. For our analysis, we've used the various helpful functions from the pandas package, and then we proceeded to create some interesting visualisations using pandas and seaborn on top of matplotlib. All code was written and executed within a Kaggle Notebook, here it is if you are interested, but for now all the corresponding narrative is in this article.

Now that we have a good idea about the dataset, we can move onto our experimental design. In the next part of this four part series, we will design our experiments and select some algorithms and performance measures to support our implementation and discussion of the results

Normal Distribution Test

SciPy stats.normaltest

In [7]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
In [8]:
heights_male = np.array([100, 256, 238, 116, 286, 253, 112, 165, 246, 130, 217, 269, 155,
       136, 189, 235, 255, 113, 280, 222, 259, 177, 294, 290, 225, 113,
       163, 137, 172, 127])

heights_female = np.array([126, 172, 137, 163, 113, 225, 290, 294, 175, 259, 220, 280, 111,
       255, 235, 189, 136, 150, 269, 214, 130, 243, 165, 110, 253, 286,
       116, 238, 255, 99])

print("mean heights (male): {}".format(np.mean(heights_male)))
print("mean heights (female): {}".format(np.mean(heights_female)))
mean heights (male): 197.66666666666666
mean heights (female): 196.93333333333334
In [11]:
s, p = stats.wilcoxon(heights_female, heights_male)

if p < 0.05:
  print("null hypothesis rejected, significant difference between the data-sets")
else:
  print("null hypothesis accepted, no significant difference between the data-sets")

print("p value = {}".format(p))
null hypothesis accepted, no significant difference between the data-sets
p value = 0.9425801920860144
In [13]:
plt.hist(heights_male, color="Magenta", normed=1)
plt.xlim(100,300)
plt.xlabel('Height');
plt.show()
In [14]:
plt.hist(heights_female,color="yellow", normed=1);
plt.xlim(100,300)
plt.xlabel('Height');
plt.show()
In [15]:
SEM = []

for sample_size in range(3,len(heights_male)+1):
    sample = heights_male[0:sample_size]
    SEM.append(sample.std() / np.sqrt(sample_size))

plt.plot(range(3,len(heights_male)+1),SEM, marker='o', color='cyan')

plt.ylabel("Standard Error of the Mean ($SE_M$)")    
plt.xlabel("Sample size $(n)$")
plt.title("Relationship between $SE_M$ and $n$");
In [18]:
print(stats.normaltest(heights_male))
print(stats.normaltest(heights_female))
NormaltestResult(statistic=13.548310785013712, pvalue=0.0011429354242245898)
NormaltestResult(statistic=13.278600632632264, pvalue=0.001307942069480237)

Pairwise Comparison Example

Pairwise Comparison

Pairwise comparison of data-sets is very important. It allows us to compare two sets of data and decide whether:

  • one is better than the other,
  • one has more of some feature than the other,
  • the two sets are significantly different or not.

In the context of the weather data that you've been working with, we could test the following hypotheses:

  • Hypothesis 1: The mean temperature in China is greater than the mean temperature in Japan
  • Hypothesis 2: The mean humidity in Russia is greater than the mean humidity in Spain

There are many tests for rejecting the null hypothesis (that there is no difference between the two data-sets), which one you use depends on the properties of your data-set. Below is a brief list of some of the available tests:

Parametric Non-parametric
Paired t-test Wilcoxon signed-rank test
Unpaired t-test Mann-Whitney U test
Pearson correlation Spearman correlation
... ...

You can see they've been divided into two groups, parametric tests and non-parametric tests. Which group you use again depends on the properties of your data-set, in this case it's whether your data-sets follow a specific distribution. Typically, if your data-set doesn't follow a specific distribution, you want to use a non-parametric test. These tests don't make any assumptions about the distribution of your data.

Before we move onto running any of these tests - we need to determine a sufficient sample size. This is because we cannot make a reliable judgement if, for example, we only compare 4 temperature readings from Japan to 4 from China. The mean temperature of both countries may be the same - but was there enough data to make this claim?

Sample Size Sufficiency

Let's get straight into it! We will be using numpy, matplotlib.pyplot, and pandas, so make sure to import these in the following cell. I've imported numpy for you. If you have access to it, I recommend also importing seaborn which automatically improves the quality of your plots. You may not have this in your labs so feel free to comment it out.

In [1]:
%matplotlib inline 

# imports here
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

Now we want to load in our data-set from the CSV file. You can use the pd.read_csv() function. You will also need to add the following column names:

wt.columns = ['time', 'country', 'humidity', 'pressure', 'temp', 'max_temp', 'min_temp']

In [2]:
# load data into variable 'wt' here
wt = pd.read_csv("weather.csv", names=['time', 'country', 'humidity', 'pressure', 'temp', 'max_temp', 'min_temp'])

# output a sample of the data for inspection...
wt.head()
Out[2]:
time country humidity pressure temp max_temp min_temp
0 15/03/14 NP 33.0 1020.00 297.15 297.15 297.15
1 15/03/14 VE 99.0 835.59 288.64 288.64 288.64
2 15/03/14 CN 12.0 613.65 280.79 280.79 280.79
3 15/03/14 TR 87.0 1021.00 283.42 284.15 283.15
4 15/03/14 DO 100.0 1020.65 295.14 295.14 295.14

Let's check the sample size sufficiency for Hypothesis 1, for this we will need to pick out the temperatures for China and Japan, let's store them in variables named temp_JP and temp_CN respectively.

In [3]:
# variables holding the temperatures for Japan and China
temp_JP = wt[wt.country=='JP'].temp
temp_CN = wt[wt.country=='CN'].temp

So we have an idea of the number of samples for each data-set, let's print out the number of samples for inspection. There are man ways to do this, using .count(), .size, .shape, etc.

In [4]:
# print out the number of samples for inspection
print "number of temperature samples for Japan: {}".format(temp_JP.count())
print "number of temperature samples for China: {}".format(temp_CN.count())
number of temperature samples for Japan: 708
number of temperature samples for China: 282

Next, let's store the mean of temp_JP and temp_CN in variables named temp_JP_mean and temp_CN_mean respectively

In [5]:
# variables holding the mean temperature for Japan and China
temp_JP_mean = temp_JP.mean() 
temp_CN_mean = temp_CN.mean() 

# print out the means for inspection
print "mean temperature in Japan: {}".format(temp_JP_mean)
print "mean temperature in China: {}".format(temp_CN_mean)
mean temperature in Japan: 281.628714689
mean temperature in China: 288.607340426

It may be tempting at this point to state that the temperature in China is typically warmer than the temperature in Japan. But this could simply be by chance! We must be confident that we can make this claim with some statistical significance.

Now that we have the data-sets and their means, we can have a look at the relationship between sample sizes and the Standard Error of the Mean (SEM). The formula for this is:

$$SE_{M} = \frac{s}{\sqrt{n}}$$

We will implement this for all our possible sample sizes for Japan, and then China below.

In [6]:
SEM = []

for sample_size in range(3,len(temp_JP)):
    sample = temp_JP[0:sample_size]
    SEM.append(sample.std() / np.sqrt(sample_size))
    
plt.plot(SEM, color='magenta')
plt.plot(200,SEM[200], marker='o', color='cyan')

plt.ylabel("Standard Error of the Mean ($SE_M$)")    
plt.xlabel("Sample size $(n)$")
plt.title("Relationship between $SE_M$ and $n$");
In [7]:
SEM = []

for sample_size in range(3,len(temp_CN)):
    sample = temp_CN[0:sample_size]
    SEM.append(sample.std() / np.sqrt(sample_size))

plt.plot(SEM, color='magenta')
plt.plot(200,SEM[200], marker='o', color='cyan')

plt.ylabel("Standard Error of the Mean ($SE_M$)")    
plt.xlabel("Sample size $(n)$")
plt.title("Relationship between $SE_M$ and $n$");

In the two figures above we can see that we have selected sufficient sample sizes of 200 for Japan and China. This decision has been made visually based on the dispersion of the samples. There are many ways of doing this, but let's move onto using these samples to test our hypothesis!

First, let's import scipy

In [8]:
# imports here
from scipy import stats

Now let's check to see if our data-sets are normally distributed, so we know whether to use parametric or non-parametric tests to compare them. For this we will use the typical alpha, $\alpha = 0.05$, as the cut-off for significance.

Let's start with Japan.

In [9]:
s, p = stats.normaltest(temp_JP) # p = p-value
if p < 0.05:
  print 'JP temperature data is not normal'
else:
  print 'JP temperature data is normal'
7.92110556459e-11
JP temperature data is not normal

Now let's check China.

In [10]:
s, p = stats.normaltest(temp_CN) # p = p-value
if p < 0.05:
  print 'CN temperature data is not normal'
else:
  print 'CN temperature data is normal'
CN temperature data is not normal

Let's plot the normalised histograms for the temperatures in Japan and China to look at this visually.

In [11]:
# plot histograms for temp_JP and temp_CN, make sure to set normed to True!
temp_JP.hist(color='magenta', normed=True);
temp_CN.hist(color='cyan', alpha=0.5, normed=True);

Our data is supposedly not following a normal distribution, so we may decide to use a non-parametric test. There is much to consider when selecting a test to use, but we will use the Wilcoxon-signed rank test.

Using scipy.stats.wilcoxon, conduct a pairwise comparison of temperatures from Japan to temperatures from China to see if the null hypothesis is rejected.

In [12]:
s, p = stats.wilcoxon(temp_JP[0:200], temp_CN[0:200])

if p < 0.05:
  print 'null hypothesis rejected, significant difference between the data-sets'
else:
  print 'null hypothesis accepted, no significant difference between the data-sets'
null hypothesis rejected, significant difference between the data-sets

Now, let's use a quick if statement to resolve whether the temperature in our China sample is greater than our Japan sample. We can make this claim now that we have checked the significance of our results.

In [13]:
if(temp_CN[0:200].mean() > temp_JP[0:200].mean()):
    print "Hypothesis 1 accepted, it is warmer in China"
else:
    print "Hypothesis 1 rejected, it is warmer in Japan"
Hypothesis 1 accepted, it is warmer in China

Extras

  1. Copy this notebook and try testing Hypothesis 2, make sure your decision to use parametric or non-parametric testing is justified.
  2. Check out the real-time plotting below.
In [14]:
from IPython.display import display, clear_output

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1) 

for i in range(200):
    x = range(0,i);
    y = SEM[0:i]
    
    ax.cla()
    ax.plot(x, y)
    plt.ylabel("Standard Error of the Mean ($SE_M$)")    
    plt.xlabel("Sample size $(n)$")
    plt.title("Relationship between $SE_M$ and $n$");
    display(fig)
    
    clear_output(wait = True)
  
    plt.pause(0.1)

Part 0 - Getting Started with Kaggle

What is Kaggle?

Kaggle is a website located at http://kaggle.com that markets itself as being the place to do data science projects.

It has five primary offerings:

  1. Competitions - Kaggle hosts many different types of competitions, with a range of problems requiring different levels of skill to solve. These range from featured competitions which are full-scale challenges and often offer prizes up to as high as a million dollars, all the way to getting started competitions **** which are relatively easy and serve as a good entry into the competition format.
  2. Datasets - Many people end up on the Kaggle website when using search engines in pursuit of datasets. There is a wealth of datasets from a range of domains which can support a variety of interesting projects. What strengthens this offering is that the datasets include good documentation, a built-in discussion system which can act as a useful knowledge-base, and easily accessible code from community contributors which do interesting things with the data itself.
  3. Kernels - To conduct your data science experiments you 'll often need an environment with an installation of your programming language of choice, complete with some kind of scientific stack. With Kaggle Kernel's, you can get started on the actual data science without having to worry about maintaining your own local installation - It operates entirely through the web browser! This has the benefit of being a cloud environment which facilitates reproducible code and collaboration. Of course, once you have a good understanding of your needs, it will always be better to invest in your own hardware and local environment.
  4. Discussion - The website is backed by a strong community, and this is evident in the discussion areas offered by the website. You can visit one of the many forum areas, e.g. Questions & Answers which is for requesting technical advice from other data scientists, to a feedback area where you can request new features or discuss existing ones.
  5. Learn - You may want to pick up a new skill related to data science, or simply re-cap on an existing one. Kaggle offers free and short courses in data science, from learning SQL, to gaining some Data Visualisation knowledge to get useful representations of your data and results.

Getting started

Let's go through the process of signing up to Kaggle and firing up a Kernel to execute a Hello World program in Python.

The http://kaggle.com user sign-up form.

Direct your browser to http://kaggle.com and register your account using one of the sign-up approaches, if you agree to their terms. You will need to activate your account via a verification e-mail which should arrive immediately.

In the top navigation, click Notebooks. You will be presented with a list of public Notebooks submitted and maintained by the community - this may be an interesting source of knowledge for you, to see how other data scientists do things.

In the top right, click the New Notebook button, which will allow you to pick between two types of development environment: Script and Notebook. These behave differently in the way the code is executed and how variables are handled during runtime. Let 's pick Notebook for the time-being, but you may wish to explore Script too.

Starting a new Kaggle Notebook and choosing between the Script or Notebook type.

Once you have selected your Kernel type, you will be taken to a new notebook loaded into the Kaggle Notebook environment. Here you can enter your code directly, whether it is in Python or R which can be toggled using a dropdown in the interface. On the right-hand side you will see various widgets which include Session Information, Versioning, Information, Data Sources, Settings, Documentation, and a link to the API.

Notebook loaded into the Kaggle Notebook environment.

To implement and run our Hello World program, let 's first remove the code in the existing cell. You can click inside the first cell, select all the code, and hit delete/backspace on your keyboard. Now you're ready to write your program. Type in the following code:

print("Hello World")

and click the Play button to the left of the cell to execute it. You should see the output appear below the cell, "Hello World"!

A simple Hello World program written in Python within a Kaggle Notebook.

That's all there is to getting your Hello World program running within a Kaggle Notebook. If you need to brush up on your Python skills, you can also check out one of the free courses mentioned earlier in this article: https://www.kaggle.com/learn/python.

Adding a new cell

To add a new code cell and get it's output, then we need to click on the blue "+" buttons. The one with an arrow up adds a cell above the current cell, and the one with an arrow down adds a cell below the current cell:

Next…

In the next article we'll have a look at using a Kaggle Notebook for some machine learning tasks.

Python Cheatsheet

Who is this for?

This crash-course makes the assumption that you already have some programming experience, but perhaps none with Python.

Comments

In [1]:
# You can write a comment using an octothorpe or 'hash' symbol

Printing to terminal

In [2]:
print("Hello World!")
Hello World!

Variables and assignment

In [3]:
my_number = 5
my_float = 0.2
my_string = "Hello"
also_my_string = 'Hello'
In [4]:
print("{} is of {}".format(my_number, type(my_number)))
print("{} is of {}".format(my_float, type(my_float)))
print("{} is of {}".format(my_string, type(my_string)))
print("{} is of {}".format(also_my_string, type(also_my_string)))
5 is of <class 'int'>
0.2 is of <class 'float'>
Hello is of <class 'str'>
Hello is of <class 'str'>

Terminal input

In [5]:
my_result = input('please enter a word: ')
print("result: {}".format(my_result))
please enter a word: Hello
result: Hello

Mathematical operations

In [6]:
my_result = 5 + 2 # add
print("result: {}".format(my_result))
result: 7
In [7]:
my_result = 5 - 2 # substract
print("result: {}".format(my_result))
result: 3
In [8]:
my_result = 5 * 2 # multiply
print("result: {}".format(my_result))
result: 10
In [9]:
my_result = 5 / 2 # divide
print("result: {}".format(my_result))
result: 2.5
In [10]:
my_result = 5 % 2 # modulo
print("result: {}".format(my_result))
result: 1
In [11]:
my_result = 5 ** 5 # exponent
print("result: {}".format(my_result))
result: 3125

Boolean operations

In [12]:
my_result = True
print("result: {}".format(my_result))

my_result = False
print("result: {}".format(my_result))
result: True
result: False
In [13]:
# and
my_result = False and False
print("result: {}".format(my_result))

my_result = False and True
print("result: {}".format(my_result))

my_result = True and False
print("result: {}".format(my_result))

my_result = True and True
print("result: {}".format(my_result))
result: False
result: False
result: False
result: True
In [14]:
# or
my_result = False or False
print("result: {}".format(my_result))

my_result = False or True
print("result: {}".format(my_result))

my_result = True or False
print("result: {}".format(my_result))

my_result = True or True
print("result: {}".format(my_result))
result: False
result: True
result: True
result: True
In [15]:
# not
my_result = not True
print("result: {}".format(my_result))

my_result = not False
print("result: {}".format(my_result))
result: False
result: True

Relational operations

In [16]:
my_result = 5 > 2 # greater than
print("result: {}".format(my_result))

my_result = 5 >= 2 # greater than or equal to
print("result: {}".format(my_result))

my_result = 5 < 2 # less than
print("result: {}".format(my_result))

my_result = 5 <= 2 # less than or equal to
print("result: {}".format(my_result))

my_result = 5 == 2 # equal to
print("result: {}".format(my_result))
result: True
result: True
result: False
result: False
result: False

Conditional statements

In [17]:
age = 17
uk_drinking_age = 18

if(age >= uk_drinking_age): # IF
    print("you can drinking!")
elif(age == 17):            # ELSE IF
    print("you can drink on your birthday!")
else:                       # ELSE
    print("you can't drink!")
you can drink on your birthday!

Data structures

In [18]:
# LIST
shopping_list = ["Carrots", "Onions", "Chicken", "Coconuts"]
print(shopping_list)
print(shopping_list [0]) # Python is zero-indexed

shopping_list.append("Ice cream")
print(shopping_list)
['Carrots', 'Onions', 'Chicken', 'Coconuts']
Carrots
['Carrots', 'Onions', 'Chicken', 'Coconuts', 'Ice cream']
In [19]:
# DICTIONARY
shopping_list = {"Carrots": 5, "Onions": 2, "Chicken": 1, "Coconuts": 1}
print(shopping_list)
print("Careful - dictionaries are not ordered in older versions of Python.")
print(shopping_list['Carrots'])
{'Carrots': 5, 'Onions': 2, 'Chicken': 1, 'Coconuts': 1}
Careful - dictionaries are not ordered in older versions of Python.
5
In [20]:
# TUPLE
my_result = (5, 6)
print(my_result)

my_result = (5, "hello", 5.5, True, 2)
print(my_result)
(5, 6)
(5, 'hello', 5.5, True, 2)

Loops

In [21]:
# FOR

shopping_list = ["Carrots", "Onions", "Chicken", "Coconuts"]

for item in shopping_list:
    print(item)
Carrots
Onions
Chicken
Coconuts
In [22]:
# FOR

for index in range(0,10):
    print(index)
0
1
2
3
4
5
6
7
8
9
In [23]:
# WHILE
number = 0

while(number < 18):
    print(number)
    number += 1
    
print("number was no longer below 18!")
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
number was no longer below 18!

Functions

In [24]:
def my_function(): # create a function
    print("No parameters passed in")
    
my_function()      # call a function
No parameters passed in
In [25]:
def another_function(name):
    print("Parameter passed in: {}".format(name))
    
another_function("Derek")
Parameter passed in: Derek
In [26]:
def greater_than(left_operand, right_operand): 
    my_result = left_operand > right_operand
    return my_result      # returning a value from the function

print(greater_than(5, 2)) # printing the value returned from the function
True

Packages

In [27]:
import datetime

print(datetime.datetime.now())
2019-07-12 19:37:47.117505

Standard Deviation

Standard Deviation

This is a brief re-cap on calculating the standard deviation.

Let's assume that we have measured the height of each member in our population below:

Vehicle problem

Figure 1 - The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Images taken from http://www.mathsisfun.com/data/standard-deviation.html

We'll put these in a numpy array so we can use them later on:

In [1]:
import numpy as np
heights = np.array([600, 470, 170, 430, 300])

Numpy has many useful built-in functions. One of these calculates the standard deviation for any set of data, below you can see this in action:

In [2]:
print("standard deviation is {}".format(np.std(heights)))
standard deviation is 147.32277488562318

Calculating the standard deviation ourselves

As always, it's recommended you use these well-maintained functions instead of writing your own. However, with this population size being so small, we can get a better understanding of how the standard deviation is calculated if we do it ourselves.

Let's have a look at the formula: $$\sigma = \sqrt{\frac{\Sigma{(x - \mu)^2}}{N}}$$

To those of you who aren't comfortable with mathematics, this can be overwhelming. But let's break it down and calculate everything step-by-step:

First we want to find $\mu$ which is just the mean of the population values. The mean is just all the values summed together and then divided by the population size. For our population, this would be:

In [3]:
population_mean = np.sum(heights) / heights.size
print("population mean is: {}".format(population_mean))
population mean is: 394.0

We can find the mean using a built-in numpy function too:

In [4]:
population_mean = np.mean(heights)
print("population mean is: {}".format(population_mean))
population mean is: 394.0

Great - now we have the mean, let's plot it on our graph in green:

Vehicle problem

Figure 2 - The population mean (green) plotted on our graph.

Now we want to calculate $(x - \mu)$, which is difference between each height and the mean:

Vehicle problem

Figure 3 - Difference between each height and the mean.

In Python + numpy, we can calculate this:

In [5]:
height_differences = heights - population_mean
print("height differences: {}".format(height_differences))
height differences: [ 206.   76. -224.   36.  -94.]

Now to get the variance which is this part: $$\frac{\Sigma{(x - \mu)^2}}{N}$$

To do this, we first square all our height differences above, and get the average of them.

In [6]:
variance = (height_differences**2) 
variance = np.sum(variance)
variance = variance / height_differences.size
print("variance: {}".format(variance))
variance: 21704.0

Again, we can find the variance using a built-in numpy function too:

In [7]:
variance = np.var(height_differences)
print("variance: {}".format(variance))
variance: 21704.0

Finally, the last part is to find the square root of our variance, which is the standard deviation:

$$\sigma = \sqrt{\frac{\Sigma{(x - \mu)^2}}{N}}$$
In [8]:
standard_deviation = np.sqrt(variance)
print("standard deviation is {}".format(standard_deviation))
standard deviation is 147.32277488562318

When we plot this standard deviation on our graph we get the following:

Vehicle problem

Figure 4 - Standard deviation (purple) plotted on our graph.

With the standard deviation, we can suggest which heights are within our standard deviation (147mm) of the mean. So now there's an approach to knowing what is normal, what is extra large, or extra small.

Population vs Sample

If there data we're working with is a sample taken from a larger population, then there's a slight difference in calculating the standard deviation.

Instead of: $$\sigma = \sqrt{\frac{\Sigma{(x - \mu)^2}}{N}}$$

We use: $$s = \sqrt{\frac{\Sigma{(x - \bar{x})^2}}{n - 1}}$$

where $\bar{x}$ is just the sample mean, and $n - 1$ is the sample size minus 1.