From 92a156567809d600266a68fb368d08d49a1e1f9e Mon Sep 17 00:00:00 2001
From: Didier SPEZIA <dspezia@amadeus.com>
Date: Wed, 6 Jan 2021 17:43:02 +0100
Subject: [PATCH] Updated documentation following Slavko's review

---
 Readme.md | 52 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/Readme.md b/Readme.md
index 94d166a..5080de3 100644
--- a/Readme.md
+++ b/Readme.md
@@ -2,7 +2,9 @@
 
 ## Purpose
 
-cpubench1a is a CPU benchmark whose purpose is to measure the global computing power of a Linux machine. It is used at Amadeus to qualify bare-metal and virtual boxes, and compare generations of machines or VMs. It runs a number of throughput oriented tests to establish:
+cpubench1a is a CPU benchmark whose purpose is to measure the global computing power of a Linux machine. It is used at [Amadeus](https://www.amadeus.com) to qualify bare-metal and virtual boxes, and compare generations of machines or VMs. It is delivered as static self-contained Go binaries for x86_64 and Aarch64 CPU architectures.
+
+It runs a number of throughput oriented tests to establish:
 
  - an estimation of the throughput a single OS processor can sustain (i.e. a single-threaded benchmark)
  - an estimation of the throughput all the OS processors can sustain (i.e. a multi-threaded benchmark)
@@ -11,6 +13,8 @@ An OS processor is defined as an entry in the /proc/cpuinfo file. Depending on t
 
 ## Build and launch
 
+It is recommended to build the binaries on a reference machine (eventually different from the machines to be compared by the benchmark). The idea is to use the same binaries on all the machines part of the benchmark to make sure the same exact code is run everywhere. We suggest to use any Linux x86_64 box supporting the Go toolchain (no specific constraint here), and build the ARM binary using cross compilation.
+
 To build from source (for x86_64, from a x86_64 box):
 
 ```
@@ -23,26 +27,24 @@ To build from source (for Aarch64, from a x86_64 box):
 GOOS=linux GOARCH=arm64  go build
 ```
 
-It recommended to build on a reference machine, which is different from the machines to be compared by the benchmark.
-
-The resulting binary is statically linked and not dependent on the Linux distribution or version. In order to run the benchmark, the binary can be copied directly to the target machines, and run there. 
+The resulting binary is statically linked and not dependent on the Linux distribution or version (except a minimal 2.6.23 kernel version). In order to run the benchmark, the binary can be directly copied to the target machines, and run there. 
 
 The command line parameters are:
 
 ```
 Usage of ./cpubench1a:
   -bench
-    	Run standard benchmark
+    	Run standard benchmark (multiple iterations). Mutually exclusive with -run
   -duration int
-    	Duration in seconds (default 60)
+    	Duration in seconds of a single iteration (default 60)
   -nb int
     	Number of iterations (default 5)
   -run
-    	Run a single benchmark iteration
+    	Run a single benchmark iteration. Mutually exclusive with -bench
   -threads int
-    	Number of threads (default -1)
+    	Number of Go threads (i.e. GOMAXPROCS). Default is all OS processors (default -1)
   -workers int
-    	Number of workers (default -1)
+    	Number of workers. Default is 4*threads (default -1)
 ```
 
 The canonical way to launch the benchmark is just:
@@ -61,12 +63,14 @@ The principle is very similar to SPECint or Coremark integer benchmarks. It is b
 
 - deflate compression/decompression + base64 encoding/decoding
 - sorting a set of records in multiple orders
-- an awk intepreter (parsing and execution of awk programs)
-- JSON parsing and encoding
+- parsing and executing short awk programs
+- JSON messages parsing and encoding
 - building/using some btree data structures
 - a Monte-Carlo simulation calculating the availability of a NoSQL cluster
 
-These algorithms are not specifically representative of a given Amadeus application or functional transaction. Compression/decompression, encoding/decoding, data structures management, sorting small datasets are typical of back-end software though. The relative execution time of the various algorithms can be checked using:
+These algorithms are not specifically representative of a given Amadeus application or functional transaction. Compression/decompression, encoding/decoding, data structures management, sorting small datasets are typical of back-end software though.
+
+To check the benchmark is relevant (and the execution time of one algorithm does not dwarf all the other ones), the relative execution time of the various algorithms can be displayed using:
 
 ```
 $ go test -bench=.
@@ -84,31 +88,35 @@ PASS
 ok  	cpubench1a	8.931s
 ```
 
-There is a main driver and multiple workers. The driver is pushing transactions to a queue Each worker listens and fetch transactions from the queue, and execute them. Each transaction executes the above algorithms (all of them). The implementation of these algorithms has been designed to be independent from the context (i.e. reentrant, no contention on shared data). The queuing/dequeuing overhead is negligible compared to the transaction execution time. The queue is saturated for all the benchmark duration except at the end, so there is no wait state in the workers.
+The architecture of the benchmark program is the following. There are a main driver and multiple workers. The driver is pushing transactions to a queue. Each worker fetches transactions from the queue, and executes them. Each transaction executes the above algorithms (all of them). The implementation of these algorithms has been designed to be independent from the context (i.e. reentrant, no contention on shared data), and CPU bound. The queuing/dequeuing overhead is negligible compared to the transaction execution time. The queue is saturated for all the benchmark duration except at the end, so there is no wait state in the workers.
 
-A single-threaded run only involves a single worker. A multi-threaded run involves as many workers as OS processors (by default).
+A single-threaded run only involves a single worker. A multi-threaded run involves 4 workers per OS processors (by default).
 
-Each test runs for a given duration (typically 1 minute). The score of the benchmark is simply the number of transaction executions per second.
+Each test iteration runs for a given duration (typically 1 minute). The score of the benchmark is simply the number of transaction executions per second.
 
-A normal benchmark run involves 5 tests in single-threaded mode, and 5 tests in multi-threaded mode. The resulting score is defined as the **maximum** reported throughput in each category.
+A normal benchmark run involves 5 test iterations in single-threaded mode, and 5 test iterations in multi-threaded mode. The resulting score is defined as the **maximum** reported throughput in each category. Considering the maximum aims to counter the negative effect of CPU throttling, variable frequencies and noisy neighbours on some environments.
 
 ## Rationale: avoiding pitfalls
 
 We have decided to write our own benchmark to avoid the following issues:
 
-- we wanted to test CPUs, and not operating systems, compilers, system libraries, runtime, etc ... 
+- we wanted to test CPUs only, and not operating systems, compilers, system libraries, runtime, etc ... of the tested platform. Of course, we still depend on the Go compiler/runtime, but it is supposed to be the same compiler/runtime for all the tests.
 
-- we wanted to compare various physical or virtual on-prem machines, but also public cloud boxes. Some machines use a fixed CPU frequency. Some others use a variable CPU frequency and the guest operating system is aware. Some others use a variable CPU frequency at hypervisor level, but the guest operating system is not aware. Most CPUs adapt the frequency to their workload.
+- we wanted to compare various physical or virtual on-premises machines, but also public cloud boxes. Some machines use a fixed CPU frequency. Some others use a variable CPU frequency and the guest operating system is aware. Some others use a variable CPU frequency at hypervisor level, but the guest operating system is not aware. Most CPUs adapt the frequency to their workload. By design, this benchmark presents meaningful figures whatever the situation.
 
 - we wanted to mitigate the risk of having a vendor optimizing its offer to shine in a well-known industry benchmark.
 
-The benchmark is therefore coded in pure Go, compiled with a specific Go version on a given reference machine, and linked statically. The binaries are provided for Intel/AMD and ARM (64 bits). They are copied to the machine we want to test, so we are sure the same code is executed whatever the operating system distribution/version/flavor.
+The benchmark is therefore coded in pure Go, compiled with a specific Go version on a given reference machine, and linked statically. The binaries are provided for Intel/AMD and ARM (64 bits). They are copied to the machine we want to test, so we are sure the same code is executed whatever the Linux distribution/version/flavor. At the moment, the benchmark does only run under Linux, and no other operating system.
+
+Limiting all memory allocations while testing real-world code is difficult. Each algorithm generates some memory allocations, and therefore some garbage collection activity. We have just ensured that the garbage collection cost is low compared to the runtime cost of the algorithms. Furthermore, each test iteration runs in a separate process, so that the memory accumulated by a given iteration does not degrade the garbage collection of the next iteration.
 
-Limiting all memory allocations while testing real-world code is difficult. Each algorithm generates some memory allocations, and therefore some garbage collection activity. We have just ensured that the garbage collection cost is low compared to the runtime cost of the algorithms.
+The variability of the results especially on non isolated environments (clouds) was a concern. We have not found any better mitigation mechanism than running multiple iterations of the same test and consider the maximum score. The benchmark is known to be sensitive to:
 
-The variability of the results especially on non isolated environments (clouds) is a concern.
+ - noisy neighbors running on the same hypervisor
+ - CPU throttling, P-state/C-state configuration, at guest OS or hypervisor level
+ - the NUMA configuration, and how the threads are distributed over the NUMA nodes
 
-The benchmark measures throughput on single-threaded and multi-threaded code so that we have a score bound by the maximum frequency a CPU core can get (for single-threaded tests), and the maximum frequency **all** the cores can get (for multi-threaded tests) - which can be different.
+The benchmark measures throughput on single-threaded and multi-threaded code, so that we have a score bound by the maximum frequency a CPU core can get (for single-threaded tests), and the maximum frequency **all** the cores can get (for multi-threaded tests) - which can be different.
 
 Each test is run multiple times (we suggest 5 times as a minimum), so that the system has time to set the maximum possible frequency, and to mitigate the variability of the performance and noisy neighbour effects. Each test runs in a separate process and starts from the same memory state to avoid impacts due to the non deterministic nature of memory garbage collection. The more runs, the better accuracy of the result.