Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers
S. Ghiasvand, F. Ciorba, R. Tschuter, and W. Nagel. Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), page 377--381. Heraklion, Crete, Greece, IEEE, (February 2016)
DOI: 10.1109/PDP.2016.101
Abstract
In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.
Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
year
2016
month
feb
pages
377--381
publisher
IEEE
copyright
All rights reserved
isbn
978-1-4673-8776-7
language
en
file
Ghiasvand et al. - 2016 - Lessons Learned from Spatial and Temporal Correlat.pdf:D\:\\Documents\\Zotero\\storage\\4UL3UN97\\Ghiasvand et al. - 2016 - Lessons Learned from Spatial and Temporal Correlat.pdf:application/pdf
%0 Conference Paper
%1 ghiasvand2016lessons
%A Ghiasvand, Siavash
%A Ciorba, Florina M.
%A Tschuter, Ronny
%A Nagel, Wolfgang E.
%B Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
%C Heraklion, Crete, Greece
%D 2016
%I IEEE
%K myOwn from:ghiasvan
%P 377--381
%R 10.1109/PDP.2016.101
%T Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers
%U http://ieeexplore.ieee.org/document/7445361/
%X In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.
%@ 978-1-4673-8776-7
@inproceedings{ghiasvand2016lessons,
abstract = {In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.},
added-at = {2024-12-10T16:28:05.000+0100},
address = {Heraklion, Crete, Greece},
author = {Ghiasvand, Siavash and Ciorba, Florina M. and Tschuter, Ronny and Nagel, Wolfgang E.},
biburl = {https://puma.scadsai.uni-leipzig.de/bibtex/2c689828caf45340c6ed5933c6a1ca468/scads.ai},
booktitle = {Proceedings of the 24th {Euromicro} {International} {Conference} on {Parallel}, {Distributed}, and {Network}-{Based} {Processing} ({PDP})},
copyright = {All rights reserved},
doi = {10.1109/PDP.2016.101},
file = {Ghiasvand et al. - 2016 - Lessons Learned from Spatial and Temporal Correlat.pdf:D\:\\Documents\\Zotero\\storage\\4UL3UN97\\Ghiasvand et al. - 2016 - Lessons Learned from Spatial and Temporal Correlat.pdf:application/pdf},
interhash = {14f51cdd8f497d3fa493c4f43c90296b},
intrahash = {c689828caf45340c6ed5933c6a1ca468},
isbn = {978-1-4673-8776-7},
keywords = {myOwn from:ghiasvan},
language = {en},
month = feb,
pages = {377--381},
publisher = {IEEE},
timestamp = {2024-12-10T16:28:05.000+0100},
title = {Lessons {Learned} from {Spatial} and {Temporal} {Correlation} of {Node} {Failures} in {High} {Performance} {Computers}},
url = {http://ieeexplore.ieee.org/document/7445361/},
urldate = {2018-12-04},
year = 2016
}