PhD Defense: Ibrahim Akgun, Using Machine Learning to Improve Operating Systems' I/O Subsystems

Friday, November 18, 2022 - 11:00am to 1:00pm
NCS 115
Event Description: 

Despite the ever-changing nature of computing systems, operating systems and storage systems are still following the architectures, algorithms, and structures built decades ago.  Modern software stacks generate complicated and dynamic workloads which are running on statically configured storage stacks.  To provide the best performance for various dynamic workloads, we need self-adaptive, dynamically configured storage systems.  However, considering the current design principles of storage and operating systems, there is no support system to achieve self-adaptability.

One of the possible solutions to fulfill the self-adaptability needed in storage and operating systems is approaching operating system problems with machine learning assistance.  Researchers have tried using machine learning to solve operating system problems; however, existing solutions are either not practical or not versatile enough.  Therefore, we propose a complete pipeline to build machine learning models to improve operating system components, especially I/O subsystems and their performance.  First, we provide a low-overhead and high-fidelity data-collection framework to trace and collect data from inside operating systems.  We then develop a lightweight and efficient machine learning (ML) framework that can run at the kernel level and tune kernel parameters to improve I/O performance.

We have applied our machine learning framework, called KML, to tune disk readahead sizes according to workload-type predictions.  We used RocksDB as our benchmarking platform.  We can improve I/O performance for RocksDB's benchmark workloads, including realistic ones (e.g., Facebook's mixgraph), by up to 2.3x.  We also include another storage use case: NFS rsize.  We observed as much as 15x performance improvements for the NFS rsize use-case.  In addition to these storage use cases, we applied KML to the network stack to improve bandwidth-sharing fairness for BBR, a new TCP congestion control algorithm.  The BBR-ML model boosted bandwidth sharing fairness as much as 30% when a CUBIC and a BBR flows ran concurrently.

It is our thesis that operating systems have many heuristics built largely by hand over many years, and yet operating systems cannot easily adapt to changing environment and workload conditions; therefore, we believe that compact and efficient machine learning engines should become a first-class citizen inside operating systems and be used to improve I/O subsystems.

Computed Event Type: 
Event Title: 
PhD Defense: Ibrahim Akgun, Using Machine Learning to Improve Operating Systems' I/O Subsystems