Efficient Training and Deployment of Machine Learning on Distributed and Parallel Systems

We study how to improve performance and energy efficiency of machine learning workloads (including training and inference), how to make machine learning approachable by regular users (personalized machine learning), and how to make machine learning robust and secure.


  • Performance optimization and modeling for machine learning training

Representative publications:

  • [IPDPS’19] Jiawen Liu, Dong Li, Gokcen Kestor, and Jeffrey Vetter. Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training. In 33rd IEEE International Parallel and Distributed Processing Symposium.
  • [MICRO’18] Jiawen Liu, Hengyu Zhao, Matheus Ogleari, Dong Li, and Jishen Zhao. Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach. In 51s IEEE/ACM International Symposium on Microarchitecture.


Resource Management in Heterogeneous Computing

Heterogeneous computing uses more than one kind of processor and memory to achieve best performance and energy efficiency. The GPU-based system is an example of heterogeneous computing; The emerging non-volatile memory-based system is another example. Resource management in heterogeneous computing in our research includes data movement and placement in emerging architectures, and massive thread-level parallelism management on GPU.



Fault Tolerance in Large-Scale Parallel Systems

We study how to make the large scale parallel systems survive potential hardware failures and software errors without sacrificing system efficiency and effectiveness.