How Humans (and Machines) Integrate Language and Vision: Image Description as a Test Case

Frank Keller - School of Informatics - University of Edinburgh

How Humans (and Machines) Integrate Language and Vision: Image Description as a Test Case

Frank Keller
School of Informatics
University of Edinburgh

Joint work with Moreno Coco and Des Elliott

When humans process text or speech, this often happens in a visual context, e.g., when listening to a lecture, reading a map, or describing an image. Here, we focus on image description as an example of language/vision integration. Previous research has shown that objects in a visual scene are fixated before they are mentioned, leading us to hypothesize that the scan pattern of a participant can be used to predict what they will say. We test this hypothesis using a data set of cued scene descriptions of photo-realistic scenes. We demonstrate that similar scan patterns are correlated with similar sentences and that this correlation holds for three phases of language production (target identification, sentence planning, and speaking). We go on to show how insights from human language/vision integration can be used to build systems that automatically describe images. We propose a novel way of representing images as visual dependency graphs, where arcs between image regions are labeled with spatial relationships. The task of relating image regions to each other can then be viewed as a parsing task. We show how image parsing can be automated and how the output of an image parser can be used to generate image descriptions. The resulting system outperforms standard approaches that rely on object proximity or corpus information to generate descriptions.

Frank Keller

Frank Keller is professor of computational cognitive science in the School of Informatics at the University of Edinburgh. His background includes an undergraduate degree from Stuttgart University, a PhD from Edinburgh, and postdoctoral and visiting positions at Saarland University and MIT. His research focuses on how people solve complex tasks such as understanding language or processing visual information. His work combines experimental techniques with computational modeling to investigate reading, sentence comprehension, translation, and language generation, both in isolation and in the context of visual information such as photographs or diagrams. Prof. Keller serves on the management committee of the European Network on Vision and Language, is a member of governing board of the European Association for Computational Linguistics, and holds an ERC starting grant in the area of language and vision.