Meltdown, Spectre and High Performance Computing

February 10th, 2018 | Categories: Cloud Computing, HPC, parallel programming | Tags:

The Meltdown bug which affects most modern CPUs has been called by some ‘The worst ever CPU bug’. Accessible explanations about what the Meltdown bug actually is are available here and here.

Software patches have been made available but some people have estimated a performance hit of up to 30% in some cases. Some of us in the High Performance Computing (HPC) community (See here for the initial twitter conversation) started to wonder what this might mean for the type of workloads that run on our systems. After all, if the worst case scenario of 30% is the norm, it will drastically affect the power of our systems and hence reduce the amount of science we are able to support.

In the video below, Professor Mark Handley from University College London gives a detailed explanation of both Meltdown and Spectre at an event held at Alan Turing Institute in London.

Another video that gives a great introduction to this topic was given by Jon Masters at https://fosdem.org/2018/schedule/event/closing_keynote/

To patch or not to patch

To a first approximation, a patch causing a 30% performance hit on a system costing £1 million pounds is going to cost an equivalent of £300,000 — not exactly small change! This has led to some people wondering if we should patch HPC systems at all:

Given the size of the performance hit should we even *be* patching for this? Unless you need trusted computing, does it really matter for the average HPC?

— Phil Tooley (@acceleratedsci) January 5, 2018

All of the UK Tier-3 HPC centres I’m aware of have applied the patches (Sheffield, Leeds and Manchester) but I’d be interested to learn of centres that decided not to. Feel free to comment here or message me on twitter if you have something to add to this discussion and I’ll update this post where appropriate.

Research paper discussing the performance penalties of these patches on HPC workloads

A group of people have written a paper on Arxiv that looks at HPC performance penalties in more detail. From the paper’s abstract:

The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance. The real-world applications show a 2-3% decrease in performance for single node jobs and a 5-11% decrease for parallel multi node jobs.

The full pdf is available at https://arxiv.org/abs/1801.04329

Other relevant results and benchmarks

Here are a few other links that discuss the performance penalty of applying the Meltdown patch.

Redhat’s experiments showing worst-case performance drops of 19% in certain benchmarks.
Wikipedia article showing a summary of various benchmarks.

Acknowledgements

Thanks to Adrian Jackson, Phil Tooley, Filippo Spiga and members of the UK HPC-SIG for useful discussions.

Walking Randomly