Sample Size Sufficiency


In [12]:
# used to create block diagrams
%reload_ext xdiag_magic
%xdiag_output_format svg
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import as pio              # to set shahin plot layout
import platypus as plat              # multi-objective optimisation framework
from scipy import stats

pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'


Before conducting a comparison between algorithms we need to determine whether our sample size will be sufficient, i.e. is our sample size large enough to support our hypothesis? One approach to determine sample size sufficiency is to investigate the relationship between the sample size ($n$) and the Standard Error of the Mean $(SE_M$). This is calculated by taking the standard deviation and dividing it by the square root of the number of samples under consideration. This is done for each sample size incrementally until any further increase offers trivial gains.

Let's try to determine the sufficient sample size for our experiment using this approach.

Executing an Experiment and Generating Results

In this section, we will be using the Platypus implementation of NSGA-II to generate solutions for the DTLZ1 test problem.

First, we will create a list named problems where each element is a DTLZ test problem that we want to use.

In [4]:
problems = [plat.DTLZ1]

Similarly, we will create a list named algorithms where each element is an algorithm that we want to compare.

In [5]:
algorithms = [plat.NSGAII]

Now we can execute an experiment, specifying the number of function evaluations, $nfe=10,000$, and the number of executions per problem, $seed=250$. This may take some time to complete depending on your processor speed and the number of function evaluations.


Running the code below will take a long time to complete even if you have good hardware. To put things into perspective, you will be executing an optimisation process 250 times, per 1 test problem, per 1 algorithm. That's 250 executions of 10,000 function evaluations, totalling in at 2,500,000 function evaluations.

In [6]:
results = plat.experiment(algorithms, problems, nfe=10000, seeds=250)

Once the above execution has completed, we can initialize an instance of the hypervolume indicator provided by Platypus.

In [8]:
hyp = plat.Hypervolume(minimum=[0, 0, 0], maximum=[1, 1, 1])

Now we can use the calculate function provided by Platypus to calculate all our hypervolume indicator measurements for the results from our above experiment.

In [9]:
hyp_result = plat.calculate(results, hyp)

Finally, let's get the hypervolume indicator scores for all executions of NSGA-II on DTLZ1.

In [11]:
hyp_result = hyp_result['NSGAII']['DTLZ1']['Hypervolume']

Calculating and Plotting the Sample Error from the Mean

It may be tempting at this point to start generating results with other algorithms to start our comparison, however, it's important to determine a sufficient sample size before moving on. One approach is to look at the relationship between sample sizes and the Standard Error of the Mean (SEM). The formula for this is $SE_M = \frac{s}{\sqrt{n}}$

Let's use the sem function from scipy.stats to calculate the $SE_M$ for each sample size made possible by our experiment above. We will incrementally append these to a list so that we can plot them later.

In [38]:
SEM = []

for sample_size in range(3,len(hyp_result)):

All that's left now is to plot the $SE_M$ for each incrementally ascending sample size.

In [57]:
fig = go.Figure(
    data=go.Scatter(x=list(range(3,len(hyp_result))), y=SEM),
    layout=dict(xaxis=dict(title='Sample Size'),yaxis=dict(title='Standard Error of the Mean'))

We may decide that a sufficient sample size can be selected when the $SE_M$ starts to settle below $0.05$. In this case, a sample size of $50$ can be justified.


In this section, we have generated a large sample of results for a single algorithm on a single problem, and we have then calculated the sample error from the mean incrementally on all possible sample sizes. This has allowed us to determine a sample size that may be sufficient in the rest of our experiment. You will find different sample sizes used throughout the literature, however, you will very rarely find a clear justification for the selection. Using this approach, you can increase your confidence in the number of samples in your experiments.

Support this work

You can access this notebook and more by getting the e-book on Practical Evolutionary Algorithms.