copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Analysis of Node Failures in High Performance Computers Based on System Logs

S. Ghiasvand, F. Ciorba, R. Tschuter, and W. Nagel. International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Austin, Texas, USA, (November 2015)

Abstract

The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.

Links and resources

BibTeX key: ghiasvand2015analysis
entry type: inproceedings
address: Austin, Texas, USA
booktitle: International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
year: 2015
month: nov
file: Ghiasvand and Ciorba - Analysis of Node Failures in High Performance Comp.pdf:D\:\\Documents\\Zotero\\storage\\2DAJT265\\Ghiasvand and Ciorba - Analysis of Node Failures in High Performance Comp.pdf:application/pdf;Ghiasvand et al. - Analysis of Node Failures in High Performance Comp.pdf:D\:\\Documents\\Zotero\\storage\\93H9QMDS\\Ghiasvand et al. - Analysis of Node Failures in High Performance Comp.pdf:application/pdf
language: en
Document: http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/tech_poster_pages/post338.html

@ghiasvan's tags highlighted

Cite this publication

@inproceedings{ghiasvand2015analysis, abstract = {The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.}, added-at = {2024-12-10T16:17:47.000+0100}, address = {Austin, Texas, USA}, author = {Ghiasvand, Siavash and Ciorba, Florina M and Tschuter, Ronny and Nagel, Wolfgang E.}, biburl = {https://puma.scadsai.uni-leipzig.de/bibtex/2bf6f119756a84b3d76bf365f338833e7/ghiasvan}, booktitle = {International {Conference} for {High} {Performance} {Computing}, {Networking}, {Storage} and {Analysis} ({SC})}, file = {Ghiasvand and Ciorba - Analysis of Node Failures in High Performance Comp.pdf:D\:\\Documents\\Zotero\\storage\\2DAJT265\\Ghiasvand and Ciorba - Analysis of Node Failures in High Performance Comp.pdf:application/pdf;Ghiasvand et al. - Analysis of Node Failures in High Performance Comp.pdf:D\:\\Documents\\Zotero\\storage\\93H9QMDS\\Ghiasvand et al. - Analysis of Node Failures in High Performance Comp.pdf:application/pdf}, interhash = {3ce5a269b291227cd38165e59aa0d9b5}, intrahash = {bf6f119756a84b3d76bf365f338833e7}, keywords = {myOwn}, language = {en}, month = nov, timestamp = {2024-12-10T16:27:28.000+0100}, title = {Analysis of {Node} {Failures} in {High} {Performance} {Computers} {Based} on {System} {Logs}}, url = {http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/tech_poster_pages/post338.html}, year = 2015 }

PUMA

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Analysis of Node Failures in High Performance Computers Based on System Logs

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

PUMA

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Analysis of Node Failures in High Performance Computers Based on System Logs

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Analysis of Node Failures in High Performance Computers Based on System Logs

Comments and Reviews
(0)