Resource Management in Heterogeneous Computing

Heterogeneous computing uses more than one kind of processor and memory to achieve best performance and energy efficiency. The emerging non-volatile memory-based system is an example. Resource management in heterogeneous computing in our research includes data movement and placement in emerging architectures, and massive thread-level parallelism management on GPU.



Efficient Training and Deployment of Machine Learning on Distributed and Parallel Systems

We study how to improve performance and energy efficiency of machine learning workloads (including training and inference), and how to make machine learning approachable by regular users (personalized machine learning).


  • Performance optimization and modeling for large-scale machine learning training
  • Learning on resource-constrained devices (e.g., mobile phones)

Representative publications:

  • [IPDPS’19] Jiawen Liu, Dong Li, Gokcen Kestor, and Jeffrey Vetter. Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training. In 33rd IEEE International Parallel and Distributed Processing Symposium.
  • [ICPADS’19] Jie Liu, Jiawen Liu, Zhen Xie, and Dong Li. Performance Analysis and Characterization of Training Deep Learning Models on Mobile Devices. In 25th IEEE International Conference on Parallel andDistributed Systems. 
  • [MICRO’18] Jiawen Liu, Hengyu Zhao, Matheus Ogleari, Dong Li, and Jishen Zhao. Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach. In 51s IEEE/ACM International Symposium on Microarchitecture.


Scientific Machine Learning

We study how to enhance the usability of machine learning models on scientific HPC applications.

Representative publications:

  • [SC’19] Wenqian Dong, Jie Liu, Zhen Xie, Dong Li. Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation. In 31st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.


Fault Tolerance in Large-Scale Parallel Systems

We study how to make the large scale parallel systems survive potential hardware failures and software errors without sacrificing system efficiency and effectiveness.