Energy-efficient training of multiple deep learning models on GPU clusters
Deep learning is the current state-of-the-art technology behind many AI applications, but training complex deep learning models on powerful GPU cluster is both time and energy consuming. This project aims to design energy-efficient resource allocation and task scheduling (RATS) solutions for a GPU cluster that runs a set of deep learning training jobs. First, we will build up an open data set of the performance and power usage of an abundant set of GPU kernels with different DVFS configurations. Second, we will develop quantitative performance and power models for training deep models on multiple GPUs to consider the effect of DVFS. Third, we will tackle the RATS problem, in which training jobs arrive over time and each job is modeled by a directed acyclic graph (DAG) that contains a set of computing and communication tasks.