Open Knowledge Representation for Texts

Ido Dagan - Natural Language Processing Lab, Department of Computer Science, Bar Ilan University

Open Knowledge Representation for Texts

Ido Dagan
Natural Language Processing Lab, Department of Computer Science, Bar Ilan University

How can we capture the information expressed in large amounts of text? And how can we allow people, as well as computer applications, to easily explore it? When comparing textual knowledge to formal knowledge representation (KR) paradigms, two prominent differences arise. First, typical KR paradigms rely on pre-specified vocabularies, which are limited in their scope, while natural language is inherently open. Second, in a formal knowledge base each fact is encoded in a single canonical manner, while in multiple texts facts may be repeated with some redundant, complementary and even contradictory information.

In this talk, I will outline a new research direction, which we term Open Knowledge Representation (OKR), which aims to represent textual information in a consolidated manner, based on the available natural language vocabulary and structure. I will describe our first specification for OKR structure, motivated by a use case of representing multiple tweets describing an event, for which we have created a medium-scale annotated dataset. Our structure merges co-referring individual proposition extractions, created in an Open-IE flavor, into a representation of consolidated entities and propositions, inspired by formal knowledge graphs. Different language expressions, denoting entities, arguments and propositions, are further organized into entailment graphs, which allow tracing information redundancy and containment. I will also present some analysis of our dataset and baseline results, illustrate the potential application of OKR for text exploration and point at possible directions in which the OKR paradigm might evolve.

Ido Dagan

Ido Dagan is a Professor at the Department of Computer Science at Bar-Ilan University, Israel and a Fellow of the Association for Computational Linguistics (ACL). His interests are in applied semantic processing, focusing on textual inference and natural-language based knowledge representation and acquisition. Dagan and colleagues established the textual entailment recognition paradigm. He was the President of the ACL in 2010 and served on its Executive Committee during 2008-2011. In that capacity, he led the establishment of the journal Transactions of the Association for Computational Linguistics. Dagan received his B.A. summa cum laude and his Ph.D. (1992) in Computer Science from the Technion. He was a research fellow at the IBM Haifa Scientific Center (1991) and a Member of Technical Staff at AT&T Bell Laboratories (1992-1994). During 1998-2003 he was co-founder and CTO of FocusEngine and VP of Technology of LingoMotors.