Securing of reliable and efficient autonomous functioning of supercomputers: basic principles and system prototype

Authors

  • Aleksandr Sergeevich Antonov
  • Vadim Vladimirovich Voevodin
  • Vladimir Valentinovich Voevodin
  • Sergey Anatolevich Zhumatiy
  • Dmitriy Aleksandrovich Nikitenko
  • Sergey Igorevich Sobolev
  • Konstantin Sergeevich Stefanov
  • Pavel Artemovich Shvec

Keywords:

supercomputer; supercomputer reliability; supercomputer fault-tolerance; supercomputer monitoring; supercomputer damages; supercomputer failures; supercomputer autonomous functioning; supercomputer model of functioning.

Abstract

State-of-the-art supercomputer is extremely complex, expensive and energy-saturated system. Every component of supercomputer is unreliable and can fail any time. In RCC MSU we are working on the system aimed to eliminate bad after-effects of hardware and software failures and to secure a reliable and efficient autonomous functioning of supercomputers. The system is based on the supercomputer model represented as multi-graph.

Published

2018-23-10

Issue

Section

******************************