Dates
Tuesday, November 15, 2022 - 01:00pm to Tuesday, November 15, 2022 - 03:00pm
Location
NCS 220
Event Description


Abstract:  Training modern Deep Neural Networks (DNNs) is a highly resource intensive task. It is crucial to use expensive resources such as GPUs and SSD storage as efficiently as possible. Given the high variability in the resource requirements of training jobs and the variety of hardware configurations available, inefficiencies can arise from various sources.
We design multiple tools to improve GPU utilization for training in various scenarios. Deploying the optimal configuration of GPUs such that training time and/or cost is minimized is not obvious. We design a regression-based estimator, Ceer, to optimally select the type and number of GPUs needed for a given training time and/or cost requirements. Ceer assumes that the GPU memory is adequate to host the DNN model. However, modern DNNs often have a huge memory footprint which requires splitting the model across multiple GPUs, a practice referred to as model parallelism. To improve GPU utilization for these large models, we develop Pesto, which formulates model splitting as an integer program. Pesto minimizes inter-GPU communication while maximizing the opportunity to parallelize the model execution across multiple GPUs. Despite the optimizations applied by Ceer and Pesto, DNN training may only partially utilize GPU resources. A natural approach to leveraging the unused resources is to collocate another job with the DNN training. However, uncontrolled sharing experiences unpredictable performance, leading to SLO violations for latency-sensitive jobs. To avoid SLO infringements in shared deployments, we design Herald, a prediction-based system that increases the efficiency of GPUs by enabling controlled fine-grained, spatial- and time- sharing of GPUs between multiple training and inference jobs.

In addition to GPUs, DNN training jobs also require significant storage resources, especially for data pre-processing. The shift to cloud computing requires optimization across all data processing pipelines concurrently running across a cluster. We look at one specific instance of this problem: placement of I/O-intensive temporary intermediate data on SSD and HDD. We analyze production logs from Google's data centers for a range of data processing pipelines. Our analysis shows that learning-based strategies could extract predictive features for IOPS of temporary files involved in various transformations, which could be used to improve the efficiency of storage devices in data processing pipelines.

Event Title
PhD Thesis Defense: Ubaid Ullah Hafeez, 'Towards Efficient and Performant Distributed Machine Learning Systems'