Algorithm Performance and Statistical Significance

Preamble

In [24]:
# used to create block diagrams
%reload_ext xdiag_magic
%xdiag_output_format svg
    
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout
import platypus as plat              # multi-objective optimisation framework
from scipy import stats

pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'

Introduction

When preparing to implement multi-objective optimisation experiments, it's often more convenient to use a ready-made framework/library instead of programming everything from scratch. There are many libraries and frameworks that have been implemented in many different programming languages. With our focus on multi-objective optimisation, our choice is an easy one. We will choose Platypus which has a focus on multi-objective problems and optimisation.

Platypus is a framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms (MOEAs). It differs from existing optimization libraries, including PyGMO, Inspyred, DEAP, and Scipy, by providing optimization algorithms and analysis tools for multiobjective optimization.

In this section, we will use the Platypus framework to compare the performance of the Non-dominated Sorting Genetic Algorithm II (NSGA-II)1 and the Pareto Archived Evolution Strategy (PAES)2. To do this, we will use them to generate solutions to three problems in the ZDT test suite3.

Because both of these algorithms are stochastic, meaning that they will produce different results every time they are executed, we will select a sufficient sample size of 30 per algorithm per test problem. We will also use the default configurations for all the test problems and algorithms employed in this comparison. We will use the Hypervolume Indicator (introduced in earlier sections) as our performance metric.

This time, we will also try to test the significance of our results.

Significance testing

Finally, let's test the significance of our pairwise comparison. The significance test you select depends on the nature of your data-set and other criteria, e.g. some select non-parametric tests if their data-sets are not normally distributed. We will use the Wilcoxon signed-rank test through the following function: scipy.stats.wilcoxon():

The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

This will give us some idea as to whether the results from one algorithm are significantly different from those from another algorithm.

Executing an Experiment and Generating Results

In this section, we will using the Platypus implementation of NSGA-II and PAES to generate solutions for the ZDT1, ZDT2, and ZDT3 test problems.

First, we will create a list named problems where each element is a ZDT test problem that we want to use.

In [2]:
problems = [plat.ZDT1, plat.ZDT2, plat.ZDT3]

Similarly, we will create a list named algorithms where each element is an algorithm that we want to compare.

In [3]:
algorithms = [plat.NSGAII, plat.PAES]

Now we can execute an experiment, specifying the number of function evaluations, $nfe=5,000$, and the number of executions per problem, $seed=30$. This may take some time to complete depending on your processor speed and the number of function evaluations.

Warning

Running the code below will take a long time to complete even if you have good hardware.

In [4]:
results = plat.experiment(algorithms, problems, nfe=5000, seeds=30)

Once the above execution has completed, we can initialize an instance of the hypervolume indicator provided by Platypus.

In [5]:
hyp = plat.Hypervolume(minimum=[0, 0], maximum=[11, 11])

Now we can use the calculate function provided by Platypus to calculate all our hypervolume indicator measurements for the results from our above experiment.

In [6]:
hyp_result = plat.calculate(results, hyp)

Finally, we can display these resultsu using the display function provided by Platypus.

In [7]:
plat.display(hyp_result, ndigits=3)
NSGAII
    ZDT1
        Hypervolume : [0.989, 0.978, 0.99, 0.988, 0.986, 0.989, 0.99, 0.99, 0.978, 0.98, 0.99, 0.988, 0.99, 0.992, 0.981, 0.991, 0.991, 0.989, 0.991, 0.99, 0.989, 0.991, 0.991, 0.993, 0.979, 0.972, 0.969, 0.991, 0.991, 0.986]
    ZDT2
        Hypervolume : [0.9, 0.97, 0.903, 0.902, 0.904, 0.899, 0.902, 0.897, 0.904, 0.9, 0.898, 0.897, 0.901, 0.903, 0.971, 0.899, 0.914, 0.974, 0.9, 0.895, 0.903, 0.91, 0.905, 0.902, 0.901, 0.896, 0.903, 0.895, 0.95, 0.896]
    ZDT3
        Hypervolume : [0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.998, 0.997, 0.998, 0.998, 0.998, 0.997, 0.998, 0.998, 0.997, 0.998, 0.998]
PAES
    ZDT1
        Hypervolume : [0.988, 0.978, 0.986, 0.968, 0.994, 0.992, 0.991, 0.985, 0.991, 0.97, 0.984, 0.973, 0.983, 0.98, 0.971, 0.98, 0.982, 0.993, 0.995, 0.978, 0.987, 0.986, 0.985, 0.976, 0.957, 0.991, 0.984, 0.994, 0.972, 0.962]
    ZDT2
        Hypervolume : [0.942, 0.948, 0.981, 0.971, 0.974, 0.99, 0.944, 0.981, 0.974, 0.973, 0.945, 0.931, 0.966, 0.966, 0.94, 0.96, 0.953, 0.932, 0.987, 0.963, 0.955, 0.95, 0.963, 0.968, 0.961, 0.938, 0.963, 0.979, 0.955, 0.982]
    ZDT3
        Hypervolume : [0.977, 0.998, 0.976, 0.963, 0.964, 0.991, 0.977, 0.974, 0.996, 0.998, 0.998, 0.976, 0.977, 0.998, 0.998, 0.998, 0.971, 0.998, 0.998, 0.98, 0.976, 0.998, 0.998, 0.977, 0.973, 0.993, 0.995, 0.982, 0.967, 0.998]

Statistical Comparison of the Hypervolume Results

Now that we have a data structure that has been populated with results from each execution of the algorithms, we can do a quick statistical comparison to give us some indication as to which algorithmn (NSGA-II or PAES) performs better on each problem.

We can see in the output of display above that the data structure is organised as follows:

  • Algorithm name (e.g. NSGAII)
    • Problem name (e.g. ZDT1)
      • Performance metric (e.g. Hypervolume)
        • The score for each run (e.g. 30 individual scores).

As a quick test, let's try and get the hypervolume indicator score for the first execution of NSGA-II on ZDT1.

In [8]:
hyp_result['NSGAII']['ZDT1']['Hypervolume'][0]
Out[8]:
0.9893490511700844

To further demonstrate how this works, let's also get the hypervolume indicator score for the sixth execution of NSGA-II on ZDT1.

In [9]:
hyp_result['NSGAII']['ZDT1']['Hypervolume'][5]
Out[9]:
0.9892885520118945

Finally, let's get the hypervolume indicator scores for all executions of NSGA-II on ZDT1.

In [10]:
hyp_result['NSGAII']['ZDT1']['Hypervolume']
Out[10]:
[0.9893490511700844,
 0.9778672431726394,
 0.9897350009693344,
 0.9880313864593729,
 0.9857860250277802,
 0.9892885520118945,
 0.9897998951572315,
 0.989852404394868,
 0.9775241087124978,
 0.9802464188530361,
 0.9901561146485758,
 0.9880553924910729,
 0.9897601348125896,
 0.9915006560714865,
 0.9814783927333247,
 0.9906302511440038,
 0.9910227875349186,
 0.9893599401880443,
 0.9914591864141794,
 0.9901349122722433,
 0.989168833206244,
 0.9905198529787448,
 0.9910389181817458,
 0.9928166745260706,
 0.9790958590942261,
 0.9719785911389545,
 0.9689886368223938,
 0.9912932838088092,
 0.9912284964070069,
 0.986230368142299]

Perfect. Now we can use numpy to calculate the mean hypervolume indicator value for all of our executions of NSGA-II on ZDT1.

In [11]:
np.mean(hyp_result['NSGAII']['ZDT1']['Hypervolume'])
Out[11]:
0.9867799122848557

Let's do the same for PAES.

In [12]:
np.mean(hyp_result['PAES']['ZDT1']['Hypervolume'])
Out[12]:
0.9818640193005429

We can see that the mean hypervolume indicator value for PAES on ZDT1 is higher than that of NSGA-II on ZDT1. A higher hypervolume indicator value indicates better performance, so we can tentatively say that PAES outperforms NSGA-II on our configuration of ZDT according to the hypervolume indicator. Of course, we haven't determined if this result is statistically significant.

Let's create a DataFrame where each column refers to the mean hypervolume indicator values for the test problems ZDT1, ZDT2, and ZDT3, and each row represent the performance of an algorithm (in this case, PAES and NSGA-II).

In [13]:
df_hyp_results = pd.DataFrame(index = hyp_result.keys())

for key_algorithm, algorithm in hyp_result.items():
    for key_problem, problem in algorithm.items():
        for hypervolumes in problem['Hypervolume']:
            df_hyp_results.loc[key_algorithm,key_problem] = np.mean(hypervolumes)
            
df_hyp_results
Out[13]:
ZDT1 ZDT2 ZDT3
NSGAII 0.986230 0.895846 0.998032
PAES 0.962079 0.981975 0.997933

Now we have an overview of how our selected algorithms performed on the selected test problems according to the hypervolume indicator. Personally, I find it easier to compare algorithm performance when each column represents a different algorithm rather than a problem.

In [26]:
df_hyp_results.transpose()
Out[26]:
NSGAII PAES
ZDT1 0.986230 0.962079
ZDT2 0.895846 0.981975
ZDT3 0.998032 0.997933

Without consideration for statistical significance, which algorithm performs best on each test problem?

Significance Testing

Now let's use the Wilcoxon signed-rank test that we introduced above to see if our results are significant, or if any difference in observation occurred purely by chance.

Before using the test, we need to decide on a value for alpha, our significance level. This is essentially the “risk” of concluding a difference exists when it doesn’t, e.g., an alpha of $0.05$ indicates a 5% risk. We can consider alpha to be some kind of threshold. This will be covered in more detail in another article, but for now we will set $0.05$ as our alpha. This means if our p-value is less than $0.05$, the null hypothesis has been rejected and the samples are likely not from the same distribution. Otherwise, the null hypothesis cannot be rejected, and the samples are likely from the same distribution.

We will use NSGA-II as our benchmark algorithm, meaning we will compare every other algorithm that we're considering to NSGA-II to determine if the results were significant. Let's write some code to determine this for us:

In [34]:
algorithms = ['NSGAII', 'PAES']
problems = ['ZDT1', 'ZDT2', 'ZDT3']


df_hyp_wilcoxon = pd.DataFrame(index = [algorithms[1]])


for key_problem in problems:
    s, p = stats.wilcoxon(hyp_result[algorithms[0]][key_problem]['Hypervolume'],
                          hyp_result[algorithms[1]][key_problem]['Hypervolume'])
    df_hyp_wilcoxon.loc[algorithms[1],key_problem] = p
            
df_hyp_wilcoxon.transpose()
Out[34]:
PAES
ZDT1 0.015658
ZDT2 0.000006
ZDT3 0.000106

We can see that in every case, $p < 0.05$. Therefore according to our configuration of $alpha$, our results are statistically significant. Looking back at the mean hypervolume results:

In [35]:
df_hyp_results.transpose()
Out[35]:
NSGAII PAES
ZDT1 0.986230 0.962079
ZDT2 0.895846 0.981975
ZDT3 0.998032 0.997933

We can now say that with respect to the hypervolume indicator quality:

  • NSGA-II outperforms PAES on ZDT1.
  • PAES outperforms NSGA-II on ZDT2.
  • NSGA-II outperforms PAES on ZDT3.

The results are statistically significant.

Conclusion

In this section we have demonstrated how we can compare two popular multi-objective evolutionary algorithms on a selection of three test problems using the hypervolume indicator to measure their performance. In this case, we have also determined the significance of our results using the Wilcoxon signed-rank test.

Exercise

Create your own experiment, but this time include different algorithms and problems and determine which algorithm performs the best on each problem.


  1. Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. A. M. T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2), 182-197. 

  2. Knowles, J., & Corne, D. (1999, July). The pareto archived evolution strategy: A new baseline algorithm for pareto multiobjective optimisation. In Congress on Evolutionary Computation (CEC99) (Vol. 1, pp. 98-105). 

  3. Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002, May). Scalable multi-objective optimization test problems. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No. 02TH8600) (Vol. 1, pp. 825-830). IEEE.