2012 - National Science Foundation Grant


STONY BROOK, NY, July 10, 2012 

Vision and language provide fundamental means to interpret, learn, and communicate about the world around us. A primary goal of computer vision and natural language processing research is therefore to automatically uncover and analyze the information that images and video, or text and speech, convey about the world. Both communities are concerned with tasks that require increasingly deeper understanding, including the ability to reason with and draw inferences from this information. Since vision and language are complementary modalities, there is now also an increasing amount of work at the interface of both fields. However, progress in multimodal analysis requires a tighter collaboration between the two communities, since each currently relies on its own set of techniques, datasets and evaluation criteria.

This community planning grant explores the need for, feasibility, and usefulness of a "visual entailment" corpus and associated visual entailment recognition task. In natural language, entailment recognition is the problem of determining whether a particular statement can be inferred from a text document. This project explores a novel related problem - visual entailment - where the goal is to determine whether a statement in natural language can be inferred from an image or video. The outcomes of the project include a novel dataset and prototype research challenge, as well as increased collaboration between the vision and language communities.