Exploring Functional OpenMP Performance Across Arm Based Platforms In this work we shall discuss the usefulness of OpenMP across several readily available Arm HPC architectures. Two well-known benchmark suites will be considered: the NAS Parallel Benchmarks and the EPCC OpenMP micro-benchmark suite.
The NAS Parallel Benchmarks (https://www.nas.nasa.gov/publications/npb.html) were designed to help evaluate both MPI and OpenMP parallel performance. They consist of eight kernels, representative of those which are used commonly throughout computational fluid dynamics applications. The kernels cover memory access, Conjugate Gradient, Multi-Grid, Discrete Fourier Transforms and Linear Solvers.
The EPCC OpenMP micro-benchmark suite (https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite) is intended to measure the overheads of synchronisation, loop scheduling and array operations in the OpenMP runtime library. As such it consists of four kernels: arraybench, schedbench, syncbench and taskbench.
By choosing these benchmarks we can demonstrate the effectiveness of OpenMP run-times for HPC applications across several Arm HPC based architectures, including the Fujitsu A64FX and Marvell® ThunderX2®, and Neoverse N1 platform: AWS Graviton2.
The Arm HPC ecosystem has matured significantly over the last few years and this includes the number of available Compilers and their associated OpenMP run-time libraries. For this work, we will consider GCC, the Arm Compiler for Linux (ACfL), the NVIDIA HPC Software Development Kit (SDK) and the HPE/Cray Compiling Environment (CCE).
For the main results, we will discuss performance for the OpenMP only variant of the NAS Parallel benchmarks across the different architectures. In particular we will investigate the effectiveness of the compilers and OpenM. Further investigation will be aided by the EPCC OpenMP micro-benchmarks to look closer at the thread-level behaviour.
On the Marvell® ThunderX2® (Figure 1), using 64 OpenMP threads (pinned to the physical cores in a node ordering manner) we compare three compilers: NVHPC, GCC and ACfL since they generate the most interesting comparison. It’s clear that there is no overall winner: GCC wins four (CG.C, EP.C, IS.C, LU.C), ACfL wins three (BT.C.BLK,FT.C, MG.C) and NVHPC wins on SP.C.BLK. Performance difference is not solely down to the OpenMP runtimes and other compiler optimizations help to varying degrees between benchmarks. However, the compiler’s ability to generate optimized code within OpenMP regions does play an important role.
On the Fujitsu FX700 (HPE/Cray Apollo 80) (Figure 2), this time using 48 OpenMP threads, we compare three compilers: ACfL, GCC and CCE. The first noticeable difference between the previous results is that BT.C.BLK, EP.C, LU.C and SP.C.BLK exhibit much lower Mops/s than on the ThunderX2 whereas CG.C, FT.C, IS.C and MG.C are better (this is mostly due to the increased memory bandwidth which can be utilized by each OpenMP thread). This time GCC wins on five (BT.C.BLK, IS.C, LU.C, MG.C, SP.C.BLK) and CCE wins on the remaining three (CG.C, EP.C,FT.C). Again, a compiler’s ability to handle OpenMP regions is what helps.
The overall aim of this presentation is to guide a user’s choice of compiler and Arm HPC based platform, given they have a certain algorithm in mind which has already been parallelized with OpenMP.