Resources

DeScript (Describing Script Structure)

DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. It has 40 scenarios with approximately 100 ESDs each. The corpus also has partial alignments of event descriptions that are semantically similar with respect to the given scenario.

Link to the resource

Reference: Wanzare, L., Zarcone, A. , Thater, S. & Pinkal, M. (2016). DeScript: A Crowdsourced Database for the Acquisition of High-quality Script Knowledge. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Stefan Thater

 

Disco-SPICE (Spoken conversations from the SPICE-Ireland corpus annotated with discourse relations)

The resource contains all texts from the Broadcast interview and Telephone conversation genres from the SPICE-Ireland corpus, annotated with discourse relations according to the PDTB 3.0 and CCR frameworks.

Link to the resource

Reference: Rehbein, I., Scholman, M.C.J., Demberg, V. (2016). Annotating discourse relations in spoken language: A comparison of the PDTB and CCR frameworks. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Vera Demberg

 

InScript (Narrative texts annotated with script information)

The InScript corpus contains a total of 1000 narrative texts crowdsourced via Amazon Mechanical Turk. The texts cover 10 different scenarios describing everyday situations like taking a bath, baking a cake etc. It is annotated with script information in the form of scenario-specific events and participants labels. The texts are also annotated with coreference chains linking different mentions of the same entity within the document.

Link to the resource

Reference: Modi, A., Anikina, T. , Ostermann, S. & Pinkal, M. (2016). InScript: Narrative texts annotated with script information. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Stefan Thater

 
 

 

Modeling Semantic Expectations

This resource contains the DR predictions (by humans) on the InScript corpus. These were collected using Amazon Mechanical Turk. For details please refer to the paper mentioned below.

Link to the resource

Reference: Modi, A., Titov, I., Demberg, D., Sayeed, A. & Pinkal, M. (2016). Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction. Transactions of Association for Computational Linguistics (TACL)

 
 

 

Back-translation Annotated Implicit Discourse Relations

This resource contains annotated implicit discourse relation instances. These sentences are annotated automatically by the back-translation of parallel corpora. For details please refer to reference below.

Link to the resource

Reference: Shi, W., Yung, F., Rubino, R., & Demberg, V. (2017). Using Explicit Discourse Connectives in Translation for Implicit Discourse Relation Classification. In Proceedings of the Eighth International Joint Conference on Natural Language Processing.

 
 

 

MCScript
MCScript is a new dataset for the task of machine comprehension focussing on commonsense knowledge. Questions were collected based on script scenarios, rather than individual texts, which resulted in question–answer pairs that explicitly involve commonsense knowledge. It comprises 13,939 questions on 2,119 narrative texts and covers 110 different everyday scenarios. Each text is annotated with one of 110 scenarios. Questions are typed with a crowdsourced annotation, indicating whether they can be answered from the text or if commonsense knowledge is needed for finding an answer.
 
Link to resource
 
Reference: Ostermann, S., Modi, A., Roth, M., Thater, S., Pinkal, M. (to appear): MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge. Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.
 
Contact person: Michael Roth
 
 
 
 

 

MCScript 2.0

MCScript2.0 is a machine comprehension corpus for the end-to-end evaluation of script knowledge. It contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. The task is not challenging to humans, but existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base. Note: The download contains only the training and development data. The test data are not public as of May 2019, since the data set is used for a shared task at the COIN workshop (https://coinnlp.github.io/task1.html).

Link to resource

Reference: Ostermann, Simon ; Roth, Michael ; Pinkal, Manfred (2018)MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and ParticipantsIn Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM) , Minneapolis, USA, 2019.

Contact person: Michael Roth

 

 

Detecting everyday scenarios in Narrative texts

This resource contains sentence level annotations, with sentences (segments) labeled according to the scripts they instantiate. Each text was independently annotated by two annotators. For each text, the annotators had to identify segments referring to a scenario from a scenario list, and assign scenario labels. If a segment referred to more than one script, they were allowed to assign multiple labels. A scenario label could be either one of 200 scenarios or “None” to capture sentences that do not refer to any of the scenarios. The resource contains 504 documents, consisting of a total of 10,754 sentences. On average, each document is 35.74 sentences long.

Link to resource

Reference: Lilian D. A. Wanzare, Michael Roth & Manfred Pinkal (2019): Detecting Everyday Scenarios in Narrative texts. Proceedings of the Second Workshop of Storytelling@ACL, (to appear 2019), Florence, Italy.

Contact person: Lilian Wanzare

 

 

 

Royal Society Corpus Version 4.0

The Royal Society Corpus (RSC) is based on the first two centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1869. It includes all publications of the journal written mainly in English and containing running text. The Philosophical Transactions was the first periodical of scientific writing in England. The RSC Version 4 consists of approximately 32 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity. We also annotate the two most important topics of each text according to a topic model consisting of 24 topics. The full topic model is also available for download. The corpus is tokenized and linguistically annotated for lemma and part-of-speech using TreeTagger (Schmid 1994, Schmid 1995). For spelling normalization we use a trained model of VARD (Baron and Rayson 2008). As a special feature, we encode with each unit (word token) its average surprisal, i.e. the average amount of information it encodes in number of bits, with words as units and trigram as contexts [cf. Genzel and Charniak 2002). The release 4.0 of the corpus includes an improved OCR correction and removal of non-text tokens like formulæ and tables.

Link to the resource

Reference: Kermes, Hannah, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” In Proceedings of the LREC 2016. Portoroz, Slovenia. http://www.lrec-conf.org/proceedings/lrec2016/summaries/792.html.

Contact person: Elke Teich

 

 

EuroParl-UdS

The EuroParl-UdS corpus is a parallel corpus consisting of parliamentary debates of the European Parliament containing texts filtered based of native speakers. It is presently available for English, German and Spanish and the data is in plain text format. It contains texts of the European Parliament that were produced between 1999-2017. More specifically it consists of parallel (sentence-aligned) corpora for English into German and English into Spanish, where the source side contains texts only by native English speakers, and comparable monolingual corpora for English, German and Spanish, containing texts only by native speakers of each language.

Link to the resource

Reference: Alina Karakanta, Mihaela Vela, and Elke Teich. 2018. EuroParl-Uds: Preserving and Extending Metadata in Parliamentary Debates. Proceedings of the LREC 2018. Miyazaki, Japan.

Contact person: Elke Teich