Qingqing Cao, Thesis Preliminary annoucement, 'Efficient and Practical Neural Question Answering for Heterogeneous Platforms'

Friday, October 9, 2020 - 1:00pm to 2:50pm
Event Description: 


Title: Efficient and Practical Neural Question Answering for Heterogeneous Platforms Abstract: Question answering (QA) systems power many real-world applications ranging from intelligent personal assistants (like Alexa, Siri, and Google Assistant) to commercial search engines such as Google and Bing. Deep learning techniques have made QA systems more effective yet are expensive and impractical in many cases. It is challenging to deploy neural QA systems because they are compute-intensive and cannot run on mobile devices. This thesis seeks to make neural QA systems more efficient, usable, and accessible by adapting well-known systems and memory optimizations to different question answering pipelines and heterogeneous hardware. The key design goals in this thesis are: developing generic optimizations that (1) are broadly applicable to different QA models; (2) requires minimal adapting efforts and no repeated retraining costs (3) run these QA models efficiently on heterogeneous hardware. This thesis presents two completed projects along with two proposed works to achieve these goals. First, we present DeQA, which provides device-wide question answering capability to help mobile users find information across multiple applications on their phones more efficiently. Our measurement study shows that it is slow and unusable to run end to end deep learning-based QA models on mobile devices. We design a set of latency- and memory- optimizations widely applicable for state-of-the-art QA systems to run locally on mobile devices. The experiments show that DeQA can effectively reduce the memory footprint and speed up the QA inference by over 16x on the phone, with minimal QA accuracy drop. Second, we present DeFormer, a simple decomposition-based technique that takes pre-trained Transformer-based models and modifies them to enable faster inference for QA without having to repeat the pre-training. This is a critical requirement if we want to explore accuracy versus speed trade-offs because pre-training is expensive. The core idea of DeFormer is to decompose the lower layers of pre-trained Transformer models to process the question and context text independently, and the higher layers process them jointly. The evaluation shows that DeFormer achieves substantial speedup (2.7 to 4.3x) and reduction in memory (65.8\% to 72.9\%) for only a small loss in accuracy (0.6 to 1.8 F1) for QA. More importantly, on the mobile phone, DeFormer can reduce the latency from 10s to 3s, making a QA system more in the usable range. We propose two works to support this thesis further. Emerging deep learning hardware like mobile accelerators comes with high energy efficiency and fast inference speed to provide more powerful on-device intelligence. We propose to work on the AccVQA project, which aims to accelerate the on-device visual question answering using mobile accelerators. On the other hand, energy consumption is an essential metric for developing efficient NLP models. We propose to work on robust and accurate energy modeling of the NLP models. Contact events [at] cs.stonybrook.edu for Zoom info.

Computed Event Type: 
Event Title: 
Qingqing Cao, Thesis Preliminary annoucement, 'Efficient and Practical Neural Question Answering for Heterogeneous Platforms'