Meltdown, Spectre and High Performance Computing

February 10th, 2018 | Categories: Cloud Computing, HPC, parallel programming | Tags:

The Meltdown bug which affects most modern CPUs has been called by some ‘The worst ever CPU bug’. Accessible explanations about what the Meltdown bug actually is are available here and here.

Software patches have been made available but some people have estimated a performance hit of up to 30% in some cases. Some of us in the High Performance Computing (HPC) community (See here for the initial twitter conversation) started to wonder what this might mean for the type of workloads that run on our systems. After all, if the worst case scenario of 30% is the norm, it will drastically affect the power of our systems and hence reduce the amount of science we are able to support.

In the video below, Professor Mark Handley from University College London gives a detailed explanation of both Meltdown and Spectre at an event held at Alan Turing Institute in London.

Another video that gives a great introduction to this topic was given by Jon Masters at  https://fosdem.org/2018/schedule/event/closing_keynote/ 

To patch or not to patch

To a first approximation, a patch causing a 30% performance hit on a system costing £1 million pounds is going to cost an equivalent of £300,000 — not exactly small change! This has led to some people wondering if we should patch HPC systems at all:

All of the UK Tier-3 HPC centres I’m aware of have applied the patches (Sheffield, Leeds and Manchester) but I’d be interested to learn of centres that decided not to.  Feel free to comment here or message me on twitter if you have something to add to this discussion and I’ll update this post where appropriate.

Research paper discussing the performance penalties of these patches on HPC workloads

A group of people have written a paper on Arxiv that looks at HPC performance penalties in more detail.  From the paper’s abstract:

The results show that although some specific functions can have execution times decreased by as much as 74%, the majority of individual metrics indicates little to no decrease in performance. The real-world applications show a 2-3% decrease in performance for single node jobs and a 5-11% decrease for parallel multi node jobs.
The full pdf is available at https://arxiv.org/abs/1801.04329

 

Other relevant results and benchmarks

Here are a few other links that discuss the performance penalty of applying the Meltdown patch.

Acknowledgements

Thanks to Adrian Jackson, Phil Tooley, Filippo Spiga and members of the UK HPC-SIG for useful discussions.

  1. Royi
    February 10th, 2018 at 19:09
    Reply | Quote | #1

    This bug made it clear, we must have alternatives to Intel CPU’s.

    In the HPC domain it means to have libraries like Intel MKL and Intel IPP for other capable CPU’s.

    OpenBLAS needs more and more attention from the community.

    Once that happens we have Ryzen as alternative and in the following years, hopefully, some ARM / RISC-V based CPU’s for the HPC world.

    By the way, hen I say HPC I mean any workstation working on Data Analysis using Python / Julia / MATLAB / R / C / C++ / Etc…