This is the repo of the paper "On the Fair Comparison of Optimization Algorithms in Different Machines". In this paper, we introduce a methodology to asses the difference in performance of two optimization algorithms, when executed in different machines.
Lets say we want to compare existing results of algorithm A executed in machine M1 for time t₁ with another algorithm B. However, we cannot execute B in machine M1, instead, we need to execute it in machine M2 which might be faster or slower than machine M1. To address this, we can adjust the runtime of algorithm B in machine M2, such that the comparison is fair.
Requirements:
- All the experiments are assumed to be executed in the CPU in single thread (no parallelization).
- The CPU scores of both machines, which can be looked up in the file cpu_scores.md. It is important to use the scores from this file, as the PassMark scores change all the time, and the regression was fitted with these specific values.
To calculate the equivalent runtime for algorithm B in machine M2, we run:
python3 equivalent_runtime.py 0.5 s₁ s₂ t₁
where s₁ s₂ are the CPU scores (cpu_scores.md) of M1 and M2 respectively, and t₁ is the execution time of A in machine M1.
So for example, if the machine score of M1 is 1219, the machine score of M2 is 1012 and the runtime in M1 was 10.0 seconds, then:
python3 equivalent_runtime.py 0.5 1219 1012 10.0
>> 11.032682
Notice how in this case, the equivalent runtime for machine M2 of 11.03 is higher than the original runtime of 10.0. This is because M2 is slower (lower CPU score) than machine M1, and hence, a longer runtime for executinos in M2 is required to make the comparison fair.
In the following, we show two examples on how to statistically compare the performance of two algorithms, while considering results from different machines. This is more complex than the quick start guide, because we make more conservative predictions of the equivalent runtime, coupled with a corrected one sided sign test. If the corrected sign test is not used, then disregard these two examples and use the equivalent runtime as in the quick start guide. From now on, the optimization problem is assumed to be a minimization problem. In the following, we present two examples of how this methodology can be applied.
In this example, we will compare a simple random initialization local search procedure with a memetic search algorithm by Benlic et al.[1] for the QAP. Using the proposed methodology, we will statistically asses the difference in performance between these two algorithms, without having to actually execute the code of the memetic algorithm. In this case, the memetic search algorithm is algorithm A, because this is the algorithm that is not executed and the local search algorithm is algorithm B, because this is the algorithm whose runtime is predicted. For this experiment, we choose a probability of predicting a longer than true equivalent runtime of p𝛾 = 0.01. To generate an unbiased prediction, simply consider p𝛾 = 0.5.
Step 1: Obtaining the data
To apply the proposed methodology, we need to find certain information about the execution of the memetic algorithm. We need the list of instances to be used in the comparison, the average objective value obtained by the memetic search algorithm and the runtime of the memetic search algorithm in each of the instances. We list this information extracted from the article by Benlic et al.[1] in the table below. In addition, we need to find the CPU model of the machine in which the memetic search was run (machine M1), which is "Intel Xeon E5440 2.83GHz" as specified in their article. Finally, the machine score of this CPU, measured as PassMark single thread score is s1 = 1219, which can be looked up in the file cpu_scores.md. It is important to use the PassMark single thread scores from December 2021 in this file, as the PassMark scores change all the time, and the methodology was fitted with these scores.
Instance | Runtime in seconds, t1 | Objective value, ai |
---|---|---|
tai40a | 486 | 3141222 |
tai50a | 2520 | 4945266 |
tai60a | 4050 | 7216339 |
tai80a | 3948 | 13556691 |
tai100a | 2646 | 21137728 |
tai50b | 72 | 458821517 |
tai60b | 312 | 608215054 |
tai80b | 1878 | 818415043 |
tai100b | 816 | 1185996137 |
tai150b | 4686 | 499195981 |
sko100a | 1338 | 115534 |
sko100b | 390 | 152002 |
sko100c | 720 | 147862 |
sko100d | 1254 | 149584 |
sko100e | 714 | 149150 |
sko100f | 1380 | 149036 |
Step 2: Predicting the equivalent runtime
With the data already gathered, we need to predict the equivalent runtime of each instance for the machine in which the local search algorithm will be executed (machine M2). To make the prediction, we need the machine score s2 of this machine. The CPU model of M2 is "Intel Celeron N4100", with a PassMark single thread score of s2 = 1012. This can be looked up in the file cpu_scores.md. It is important to use the PassMark single thread scores from December 2021 in this file, as the PassMark scores change all the time, and the methodology was fitted with these scores.
With this information, we are ready to predict the equivalent runtime in machine M2. We run the script
python equivalent_runtime.py 0.01 1219 1012 t₁
where t1 is substituted with the runtime of the memetic search algorithm in each instance, listed in the table above.
Step 3: Running the experiments
Now, we execute the local search algorithm in the instances listed in the above table, using the predicted runtimes t̂2 as stopping criterion. This execution is carried out on machine M2, and the best objective function values b̂ are listed in the table below. Following the procedure by Benlic et al.[1], these best objective values are averaged over 10 executions.
Instance | Predicted runtime, t̂2 | Objective value, b̂i |
---|---|---|
tai40a | 313.68 | 3207604 |
tai50a | 1626.50 | 5042830 |
tai60a | 2614.02 | 7393900 |
tai80a | 2548.18 | 13840668 |
tai100a | 1707.82 | 21611122 |
tai50b | 46.47 | 459986202 |
tai60b | 201.37 | 609946393 |
tai80b | 1212.13 | 824799510 |
tai100b | 526.67 | 1195646366 |
tai150b | 3024.52 | 505187740 |
sko100a | 863.59 | 153082 |
sko100b | 251.72 | 155218 |
sko100c | 464.71 | 149076 |
sko100d | 809.37 | 150568 |
sko100e | 460.84 | 150638 |
sko100f | 890.70 | 150006 |
Step 4: Obtaining the corrected p-value
Once we have all the results, we need to compute the statistic #{ai < b̂i}, which counts the number of times that ai < b̂i. In this case, ai < b̂i happens 15 times, and therefore, #{ai < b̂i} = 15, which is the same as the sample size. Now we can compute the corrected p-value.
python corrected_p_value.py 0.1 15 15
>> 1.0000000
Step 5: Conclusion
Since the observed corrected p-value is not lower or equal to α = 0.05, we cannot reject H0. In this case, the conclusion is that with the amount of data that we have and the chosen target probability of type I error of 𝛼 = 0.05, we can not say that the local search algorithm has an statistically significant better performance than the memetic search algorithm. It would not be correct to conclude that the two algorithms perform statistically significantly the same, or that the memetic search performs statistically significantly better that the local search.
It is important to note that, if we had considered the original runtimes t_1 as the stopping criterion for algorithm B in machine M2 (longer than the estimated equivalent runtime t̂2), the local search would have had an unfairly longer runtime. In other words, the comparison would have been biased towards the local search.
In this second example, we will compare the same simple random initialization local search procedure with an estimation of distribution algorithm (EDA) for the QAP[2]. In this case, the EDA is algorithm A, because this is the algorithm that is not executed and the local search algorithm is algorithm B, because this is the algorithm whose runtime is predicted. For this experiment, we choose a probability of predicting a longer than true equivalent runtime of p𝛾 = 0.01.
Step 1: Obtaining the data
To apply the proposed methodology, we need to find certain information about the execution of the EDA. We need the list of instances to be used in the comparison, the average objective value obtained by the EDA and the runtime used in each instances. We list this information extracted from the paper[2] in the table below. In addition, we need to find the CPU model of the machine in which the EDA was run (machine M1), which is "AMD Ryzen 7 1800X", as specified in the paper. Finally, the machine score of this CPU, measured as PassMark single thread score is s1 = 2182, which can be looked up in the PassMark website.
Instance | Runtime in seconds, t1 | Objective value, ai |
---|---|---|
bur26a | 1.80 | 5432374 |
bur26b | 1.80 | 3824798 |
bur26c | 1.77 | 5427185 |
bur26d | 1.78 | 3821474 |
nug17 | 0.54 | 1735 |
nug18 | 0.63 | 1936 |
nug20 | 0.84 | 2573 |
nug21 | 0.95 | 2444 |
tai10a | 0.14 | 135028 |
tai10b | 0.14 | 1183760 |
tai12a | 0.22 | 224730 |
tai12b | 0.23 | 39464925 |
tai15a | 0.38 | 388910 |
tai15b | 0.38 | 51768918 |
tai20a | 0.85 | 709409 |
tai20b | 0.84 | 122538448 |
tai40a | 6.72 | 3194672 |
tai40b | 6.72 | 644054927 |
tai60a | 23.88 | 7367162 |
tai60b | 23.86 | 611215466 |
tai80a | 62.22 | 13792379 |
tai80b | 62.23 | 836702973 |
Step 2: Predicting the equivalent runtime
With the data already gathered, we need to predict the equivalent runtime of each instance for the machine in which the local search algorithm will be executed (machine M2). To make the prediction, we need the machine score s2 of this machine. The CPU model of M2 is "Intel Celeron N4100", with a PassMark single thread score of s2 = 1012. With this information, we are ready to predict the equivalent runtime in machine M2 by executing the following command
python equivalent_runtime.py 0.01 2182 1012 t₁
where t1 is substituted with the runtime of the EDA in each instance, listed in the table above.
Step 3: Running the experiments
Now, we execute the local search algorithm in the instances listed in the above table, using the predicted runtimes t̂2 as stopping criterion. This execution is carried out on machine M2, and the best objective function values b̂ are listed in the table below. Following the procedure by Arza et al.[2], these best objective values are also averaged over 20 executions.
Instance | Predicted runtime, t̂2 | Objective value, b̂i |
---|---|---|
bur26a | 1.57 | 5426670 |
bur26b | 1.57 | 3817852 |
bur26c | 1.55 | 5426795 |
bur26d | 1.56 | 3821239 |
nug17 | 0.48 | 1734 |
nug18 | 0.55 | 1936 |
nug20 | 0.74 | 2570 |
nug21 | 0.84 | 2444 |
tai10a | 0.13 | 135028 |
tai10b | 0.13 | 1183760 |
tai12a | 0.20 | 224416 |
tai12b | 0.21 | 39464925 |
tai15a | 0.34 | 388214 |
tai15b | 0.34 | 51765268 |
tai20a | 0.75 | 703482 |
tai20b | 0.74 | 122455319 |
tai40a | 5.87 | 3227894 |
tai40b | 5.87 | 637470334 |
tai60a | 20.86 | 7461354 |
tai60b | 20.84 | 611833935 |
tai80a | 54.35 | 13942804 |
tai80b | 54.36 | 830729983 |
Step 4: Obtaining the statistic computing the corrected p-value
Once we have all the results, we need to compute the statistic #{ai < b̂i}, which counts the number of times that ai < b̂i. In this case, ai < b̂i happens 4 times, and therefore, k = #{ai < b̂i} = 4. Now we compute the corrected p-value, given that the sample size (the number of instances in which ai ≠ b̂i) is n = 17 and the choosen probability of predicting a longer than true equivalent runtime is p𝛾 = 0.01.
python corrected_p_value.py 0.01 17 4
>> 0.033192784
Step 5: Conclusion
Since the observed p-value is lower than the chosen 𝛼 = 0.05, we reject H0. In this case, the conclusion is that with a probability of type I error of 𝛼 = 0.05, the performance of the local search procedure is statistically significantly better than the performance of the EDA.
In this case, machine M1 is more powerful (in terms of computational capabilities) than machine M2. If we had considered the original runtimes t1 as the stopping criterion for algorithm B in machine M2 (shorter than the estimated equivalent runtime t̂2), it would have been more difficult for the local search to perform better than the EDA. In that case, H0 might not have been rejected.
If you use the methodology in your paper, you could consider a citation:
@article{arzaFairComparisonOptimization2024,
title = {On the {{Fair Comparison}} of {{Optimization Algorithms}} in {{Different Machines}}},
author = {Arza, Etor and Ceberio, Josu and Irurozki, Ekhi{\~n}e and P{\'e}rez, Aritz},
year = {2024},
month = mar,
journal = {Annals of Applied Statistics},
volume = {18},
number = {1},
issn = {1932-6157},
doi = {10.1214/23-AOAS1778}
}
[1] Benlic, U., & Hao, J.-K. (2015). Memetic search for the quadratic assignment problem. Expert Systems with Applications, 42(1), 584-595. https://doi.org/10.1016/j.eswa.2014.08.011
[2] Arza, E., Pérez, A., Irurozki, E., & Ceberio, J. (2020). Kernels of Mallows Models under the Hamming Distance for solving the Quadratic Assignment Problem. Swarm and Evolutionary Computation, 100740. https://doi.org/10.1016/j.swevo.2020.100740