Nathan DeBardeleben
Cited by
Cited by
Addressing failures in exascale computing
M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ...
The International Journal of High Performance Computing Applications 28 (2 …, 2014
Memory errors in modern systems: The good, the bad, and the ugly
V Sridharan, N DeBardeleben, S Blanchard, KB Ferreira, J Stearley, ...
ACM SIGARCH Computer Architecture News 43 (1), 297-310, 2015
Feng shui of supercomputer memory positional effects in DRAM and SRAM faults
V Sridharan, J Stearley, N DeBardeleben, S Blanchard, S Gurumurthi
SC'13: Proceedings of the International Conference on High Performance …, 2013
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
D Tiwari, S Gupta, J Rogers, D Maxwell, P Rech, S Vazhkudai, D Oliveira, ...
2015 IEEE 21st International Symposium on High Performance Computer …, 2015
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod
Whitepaper, Dec, 2009
GPGPUs: How to Combine High Computational Power with High Reliability
LB Gomez, F Cappello, L Carro, N DeBardeleben, B Fang, S Gurumurthi, ...
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
Q Guan, N Debardeleben, S Blanchard, S Fu
Proceedings of the 2014 IEEE 28th International Parallel and Distributed …, 2014
On the diversity of cluster workloads and its impact on research results
G Amvrosiadis, JW Park, GR Ganger, GA Gibson, E Baseman, ...
2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18), 533-546, 2018
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
WM Jones, JT Daly, N DeBardeleben
Proceedings of the 19th ACM International Symposium on High Performance …, 2010
Application monitoring and checkpointing in HPC: looking towards exascale systems
WM Jones, JT Daly, N DeBardeleben
Proceedings of the 50th Annual Southeast Regional Conference, 262-267, 2012
Inter-agency workshop on hpc resilience at extreme scale
J Daly, B Harrod, T Hoang, L Nowell, B Adolf, S Borkar, N DeBardeleben, ...
National Security Agency Advanced Computing Systems, 2012
Developing scientific applications using eclipse
GR Watson, NA DeBardeleben
Computing in Science & Engineering 8 (4), 50-61, 2006
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience
N DeBardeleben, S Blanchard, Q Guan, Z Zhang, S Fu
European Conference on Parallel Processing, 282-291, 2011
GPU behavior on a large HPC cluster
N DeBardeleben, S Blanchard, L Monroe, P Romero, D Grunau, C Idler, ...
European Conference on Parallel Processing, 680-689, 2013
Towards practical algorithm based fault tolerance in dense linear algebra
P Wu, Q Guan, N DeBardeleben, S Blanchard, D Tao, X Liang, J Chen, ...
Proceedings of the 25th ACM International Symposium on High-Performance …, 2016
Experimental and analytical study of xeon phi reliability
D Oliveira, L Pilla, N DeBardeleben, S Blanchard, H Quinn, I Koren, ...
Proceedings of the International Conference for High Performance Computing …, 2017
Silent data corruption resilient two-sided matrix factorizations
P Wu, N DeBardeleben, Q Guan, S Blanchard, J Chen, D Tao, X Liang, ...
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of …, 2017
Interpretable anomaly detection for monitoring of high performance computing systems
E Baseman, S Blanchard, N DeBardeleben, A Bonnie, A Morrow
Outlier Definition, Detection, and Description on Demand Workshop at ACM …, 2016
Exploring time and frequency domains for accurate and automated anomaly detection in cloud computing systems
Q Guan, S Fu, N DeBardeleben, S Blanchard
2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing …, 2013
An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations
RM Betz, NA DeBardeleben, RC Walker
Concurrency and Computation: Practice and Experience 26 (13), 2134-2140, 2014
The system can't perform the operation now. Try again later.
Articles 1–20