Fault Tolerance in Large-Scale Parallel Systems

We study how to make the large scale parallel systems survive potential hardware failures and software errors without sacrificing system efficiency and effectiveness.

Project:

Representative publications:

  • Yingchao Huang, Kai Wu, and Dong Li. High Performance Data Persistence in Non-Volatile Memory for Resilient High Performance Computing (arXiv link).
  • Luanzheng Guo, Hanlin He, and Dong Li. Application-Level Resilience Modeling for HPC Fault Tolerance (arXiv link).
  • [SC’14] Li Yu, Dong Li, Sparsh Mittal, and Jeffrey S. Vetter. “Quantitatively Modeling Application Resiliency with the Data Vulnerability Factor”. In 26th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2014 (acceptance rate: 21%). (nominated as the best student paper)
  • [SC’13] Dong Li, Zizhong Chen, Panruo Wu, Jeffrey S. Vetter. “Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach”. In 25th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2013 (acceptance rate: 20%)
  • [SC’12] Dong Li, Jeffrey S. Vetter, and Weikuan Yu. “Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool”. In 24th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 (acceptance rate: 21%)

 

Resource Management in Heterogeneous Computing

Heterogeneous computing uses more than one kind of processor and memory to achieve best performance and energy efficiency. The GPU-based system is an example of heterogeneous computing; The emerging non-volatile memory-based system is another example. Resource management in heterogeneous computing in our research includes data movement and placement in emerging architectures, and massive thread-level parallelism management on GPU.

Projects:

Representative publications:

  • [SC’17] Kai Wu, Yingchao Huang, and Dong Li. “Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory”, In 29th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (acceptance rate: 18.7%)
  • [HPDC’16] Panruo Wu, Dong Li, Zizhong Chen, and Jeffrey Vetter. “Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory”, In 25th ACM International Symposium on High Performance Parallel and Distributed Computing (acceptance rate: 16%)
  • [MICRO’14] Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. “PORPLE: An Extensible Optimizer for Portable Data  Placement”, In 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014 (acceptance rate: 19%)
  • [PACT’13] Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S. Vetter. “Exploring Hybrid Memory for GPU Energy Efficiency through Software-Hardware Co-Design”. In 22nd ACM/IEEE International Conference on Parallel Architectures and Compilation Techniques, 2013 (acceptance rate: 17%)

 

Performance Optimization and Modeling for Parallel Systems

“Performance” here includes both execution time and power/energy efficiency. We study emerging commercial workloads (e.g., machine learning workloads in a distributed deployment) and scientific applications (e.g., molecular dynamics and combustion simulations).

Projects:

  • Performance optimization and modeling for deep neural network
  • Power-aware MPI and OpenMP

Representative publications:

  • [HPDC’14] Sparsh Mittal, Jeffrey S. Vetter, and Dong Li. “Improving Energy Efficiency of Embedded DRAM Caches for High-end Computing Systems”. In 23rd ACM International Sympoisum on High Performance Parallel and Distributed Computing, 2014 (acceptance rate: 16%)
  • [TPDS’13] Dong Li, Bronis R. de Supinski, Martin Schulz, Dimitrios S. Nikolopoulos, and Kirk W. Cameron. “Strategies for Energy Efficient Resource Management of Hybrid Programming Models”. IEEE Transaction on Parallel and Distributed Systems, 24 (1), pages 144-157, 2013.
  • [IISWC’12] Chun-Yi Su, Dong Li, Dimitrios S. Nikolopoulos, Kirk W. Cameron, Bronis R. de Supinski and Edgar A. Leon. “Model-Based, Memory-Centric Performance and Power Optimization on NUMA Multiprocessor”. In 7th IEEE International Symposium on Workload Characterization, 2012
  • [IPDPS’10] Dong Li, Bronis R. de Supinski, Martin Schulz, Kirk W. Cameron, and Dimitrios S. Nikolopoulos. “Hybrid MPI/OpenMP Power-Aware Computing”. In Proceedings of 24th IEEE International Parallel and Distributed Processing Symposium, 2010 (acceptance rate: 24%)