Meltdown, Spectre and High Performance Computing
The Meltdown bug which affects most modern CPUs has been called by some ‘The worst ever CPU bug’. Accessible explanations about what the Meltdown bug actually is are available here and here.
Software patches have been made available but some people have estimated a performance hit of up to 30% in some cases. Some of us in the High Performance Computing (HPC) community (See here for the initial twitter conversation) started to wonder what this might mean for the type of workloads that run on our systems. After all, if the worst case scenario of 30% is the norm, it will drastically affect the power of our systems and hence reduce the amount of science we are able to support.
In the video below, Professor Mark Handley from University College London gives a detailed explanation of both Meltdown and Spectre at an event held at Alan Turing Institute in London.
Another video that gives a great introduction to this topic was given by Jon Masters at https://fosdem.org/2018/schedule/event/closing_keynote/
To patch or not to patch
To a first approximation, a patch causing a 30% performance hit on a system costing £1 million pounds is going to cost an equivalent of £300,000 — not exactly small change! This has led to some people wondering if we should patch HPC systems at all:
Given the size of the performance hit should we even *be* patching for this? Unless you need trusted computing, does it really matter for the average HPC?
— Phil Tooley (@acceleratedsci) January 5, 2018
All of the UK Tier-3 HPC centres I’m aware of have applied the patches (Sheffield, Leeds and Manchester) but I’d be interested to learn of centres that decided not to. Feel free to comment here or message me on twitter if you have something to add to this discussion and I’ll update this post where appropriate.
Research paper discussing the performance penalties of these patches on HPC workloads
A group of people have written a paper on Arxiv that looks at HPC performance penalties in more detail. From the paper’s abstract:
The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance. The real-world applications show a 2-3% decrease in performance for single node jobs and a 5-11% decrease for parallel multi node jobs.
Other relevant results and benchmarks
Here are a few other links that discuss the performance penalty of applying the Meltdown patch.
- Redhat’s experiments showing worst-case performance drops of 19% in certain benchmarks.
- Wikipedia article showing a summary of various benchmarks.
Acknowledgements
Thanks to Adrian Jackson, Phil Tooley, Filippo Spiga and members of the UK HPC-SIG for useful discussions.
This bug made it clear, we must have alternatives to Intel CPU’s.
In the HPC domain it means to have libraries like Intel MKL and Intel IPP for other capable CPU’s.
OpenBLAS needs more and more attention from the community.
Once that happens we have Ryzen as alternative and in the following years, hopefully, some ARM / RISC-V based CPU’s for the HPC world.
By the way, hen I say HPC I mean any workstation working on Data Analysis using Python / Julia / MATLAB / R / C / C++ / Etc…