Oversampling the Cardiotocography Data Set

Preamble

You may be missing the imblearn package below. With anaconda, you can grab it using conda install -c conda-forge imbalanced-learn

In [56]:
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout

from imblearn.over_sampling import ADASYN

pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'

Introduction

We've covered class imbalance and over sampling in another section, so this will just serve as another example on a different dataset. In this section, we'll be using the Cardiotocography (CTG) dataset located at https://archive.ics.uci.edu/ml/datasets/cardiotocography. It has 23 attributes, 2 of which are two different classifications of the same samples, CLASS (1 to 10) and NSP (1 to 3).

Downloading the Dataset

To keep this notebook independant, we will download the CTG dataset within our code. If you've already downloaded it and viewed it using some spreadsheet software, you will have noticed that we want to use the "Data" spreadsheet, and that we also want to drop he first row.

In [57]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls"
data=pd.read_excel(url, sheet_name="Data", header=1)

Let's have a quick look at what we've downloaded.

In [58]:
data.head()
Out[58]:
b e AC FM UC DL DS DP DR Unnamed: 9 ... E AD DE LD FS SUSP Unnamed: 42 CLASS Unnamed: 44 NSP
0 240.0 357.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN ... -1.0 -1.0 -1.0 -1.0 1.0 -1.0 NaN 9.0 NaN 2.0
1 5.0 632.0 4.0 0.0 4.0 2.0 0.0 0.0 0.0 NaN ... -1.0 1.0 -1.0 -1.0 -1.0 -1.0 NaN 6.0 NaN 1.0
2 177.0 779.0 2.0 0.0 5.0 2.0 0.0 0.0 0.0 NaN ... -1.0 1.0 -1.0 -1.0 -1.0 -1.0 NaN 6.0 NaN 1.0
3 411.0 1192.0 2.0 0.0 6.0 2.0 0.0 0.0 0.0 NaN ... -1.0 1.0 -1.0 -1.0 -1.0 -1.0 NaN 6.0 NaN 1.0
4 533.0 1147.0 4.0 0.0 5.0 0.0 0.0 0.0 0.0 NaN ... -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 2.0 NaN 1.0

5 rows × 46 columns

Preparing the Dataset

Looking at the data, we can see the last 3 rows are not samples

In [59]:
data.tail()
Out[59]:
b e AC FM UC DL DS DP DR Unnamed: 9 ... E AD DE LD FS SUSP Unnamed: 42 CLASS Unnamed: 44 NSP
2124 1576.0 3049.0 1.0 0.0 9.0 0.0 0.0 0.0 0.0 NaN ... 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 5.0 NaN 2.0
2125 2796.0 3415.0 1.0 1.0 5.0 0.0 0.0 0.0 0.0 NaN ... -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 1.0 NaN 1.0
2126 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2127 NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 NaN ... 72.0 332.0 252.0 107.0 69.0 197.0 NaN NaN NaN NaN
2128 NaN NaN NaN 564.0 23.0 16.0 1.0 4.0 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 46 columns

So we will get rid of them.

In [60]:
data = data.drop(data.tail(3).index)

Let's check to make sure they're gone.

In [61]:
data.tail()
Out[61]:
b e AC FM UC DL DS DP DR Unnamed: 9 ... E AD DE LD FS SUSP Unnamed: 42 CLASS Unnamed: 44 NSP
2121 2059.0 2867.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 NaN ... 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 5.0 NaN 2.0
2122 1576.0 2867.0 1.0 0.0 9.0 0.0 0.0 0.0 0.0 NaN ... 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 5.0 NaN 2.0
2123 1576.0 2596.0 1.0 0.0 7.0 0.0 0.0 0.0 0.0 NaN ... 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 5.0 NaN 2.0
2124 1576.0 3049.0 1.0 0.0 9.0 0.0 0.0 0.0 0.0 NaN ... 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 5.0 NaN 2.0
2125 2796.0 3415.0 1.0 1.0 5.0 0.0 0.0 0.0 0.0 NaN ... -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 NaN 1.0 NaN 1.0

5 rows × 46 columns

Now let's create a dataframe that only contains our input features and our desired classification. In the spreadsheet we know our 21 features were labelled with the numbers 1 to 21, starting from the column LB and ending on the column Tendency. Let's display all the columns in our dataframe.

In [62]:
data.columns
Out[62]:
Index(['b', 'e', 'AC', 'FM', 'UC', 'DL', 'DS', 'DP', 'DR', 'Unnamed: 9', 'LB',
       'AC.1', 'FM.1', 'UC.1', 'DL.1', 'DS.1', 'DP.1', 'ASTV', 'MSTV', 'ALTV',
       'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros', 'Mode', 'Mean',
       'Median', 'Variance', 'Tendency', 'Unnamed: 31', 'A', 'B', 'C', 'D',
       'E', 'AD', 'DE', 'LD', 'FS', 'SUSP', 'Unnamed: 42', 'CLASS',
       'Unnamed: 44', 'NSP'],
      dtype='object')

There are many ways to reduce all of these columns down to the 21 we want. In this notebook, we're going to be explicit with our column selections to ensure no errors can be introduced from a change in the external spreadsheet data.

In [63]:
columns = ['LB', 'AC.1', 'FM.1', 'UC.1', 'DL.1', 'DS.1', 'DP.1', 'ASTV',
           'MSTV', 'ALTV', 'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros',
           'Mode', 'Mean', 'Median', 'Variance', 'Tendency']

features = data[columns]

Let's have a peek at our new dataframe.

In [64]:
features.head()
Out[64]:
LB AC.1 FM.1 UC.1 DL.1 DS.1 DP.1 ASTV MSTV ALTV ... Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency
0 120.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 73.0 0.5 43.0 ... 64.0 62.0 126.0 2.0 0.0 120.0 137.0 121.0 73.0 1.0
1 132.0 0.006380 0.0 0.006380 0.003190 0.0 0.0 17.0 2.1 0.0 ... 130.0 68.0 198.0 6.0 1.0 141.0 136.0 140.0 12.0 0.0
2 133.0 0.003322 0.0 0.008306 0.003322 0.0 0.0 16.0 2.1 0.0 ... 130.0 68.0 198.0 5.0 1.0 141.0 135.0 138.0 13.0 0.0
3 134.0 0.002561 0.0 0.007682 0.002561 0.0 0.0 16.0 2.4 0.0 ... 117.0 53.0 170.0 11.0 0.0 137.0 134.0 137.0 13.0 1.0
4 132.0 0.006515 0.0 0.008143 0.000000 0.0 0.0 16.0 2.4 0.0 ... 117.0 53.0 170.0 9.0 0.0 137.0 136.0 138.0 11.0 1.0

5 rows × 21 columns

Let's also create a new dataframe to store our labels, the NSP column, which contains our desired classifications.

In [65]:
labels = data['NSP']

Again, we'll check to see what we've created.

In [66]:
labels.head()
Out[66]:
0    2.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: NSP, dtype: float64

Frequency of each Classication

Now let's see how many samples we have for each class.

In [67]:
value_counts = labels.value_counts()

fig = go.Figure()
fig.add_bar(x=value_counts.index, y=value_counts.values)
fig.show()

We can see that our dataset is imbalanced, with class 1 having a significantly higher number of samples. This may be a problem for us depending on our approach, so we may want to balance this dataset before continuing our study.

Oversampling

We could perform over-sampling using an implementation of the Adaptive Synthetic (ADASYN) sampling approach in the imbalanced-learn library. We will pass in our features and labels variables, and expect outputs that have been resampled for balanced frequency.

In [68]:
features_resampled, labels_resampled = ADASYN().fit_sample(features, labels)

Let's see how many samples we have for each class this time.

In [69]:
labels_resampled = pd.DataFrame(labels_resampled, columns=['NSP'])
value_counts = labels_resampled['NSP'].value_counts()

fig = go.Figure()
fig.add_bar(x=value_counts.index, y=value_counts.values)
fig.show()

Much better!

Conclusion

In this section, we've used adaptive synthetic sampling to resample and balance our CTG dataset. The output is a balanced dataset, however, it's important to remember that these approaches should only be applied to training data, and never to data that is to be used for testing. For your own experiments, make sure you only apply an approach like this after you have already split your dataset and locked away the test set for later.