Modeling Linguistic Relationships and Language Evolution

Roman Yangarber - Department of Computer Science - University of Helsinki


Modeling Linguistic Relationships and Language Evolution

Roman Yangarber
Department of Computer Science
University of Helsinki

Linguists have been studying the relationships among languages for centuries, and in particular the question of how languages evolved over time and from which ancestral languages. In the Etymon Project, we approach these questions via quantitative, statistical means.

We will discuss several approaches to modeling linguistic evolutionary processes, i.e., processes by which language families evolve through time, which are in some ways similar to how biological populations evolve. We begin with datasets of cognates words from different languages in a language family which are believed to be genetically related, i.e., which derive from a common (typically, unobserved) ancestor via unobserved laws of sound change. The only assumption we make is that the sound laws are regular. The methods are based on the information-theoretic Minimum Description Length principle (MDL).

Our goals include:

– to find globally-optimal models of the data at the level of individual sounds,

– to discover the laws of sound change inherent in the observed data,

– to reconstruct the phylogenetic structure of the language family.

We discuss comparing the quality of the proposed models, as well as the quality of alignments in the data, ways of measuring distance between languages, and comparing the quality of different datasets for the same language family. We also consider ways of evaluating the goodness of the resulting phylogenies, relative to available “gold-standard” trees.

Our studies are based on data from the Uralic, Turkic and Indo-European language families.

If you would like to meet with the speaker, please contact Andrea Fischer.



Roman Yangarber

Roman Yangarber received his Doctorate in Computer Science with concentration in Natural Language Processing (NLP) from the Courant Institute of Mathematical Sciences, New York University (NYU) in 2001. Prior to moving to Finland in 2004 he held the post of Assistant Research Professor at NYU, where he specialized in computational linguistics, focusing on machine learning for acquisition of semantic knowledge from text. Since coming to the Department of Computer Science at the University of Helsinki, he has held the post of Acting Professor, and has led the NLP Research Group, where he advises MSc and PhD students in computational linguistics in several internationally- and nationally-funded research projects, in collaboration with partners from academia, industry, government and NGOs. The main research areas of the group are: – Web-scale surveillance of news media, – modeling of language relationships and language evolution, – automatic inference of morphological systems for morphologically rich languages, – computational tools for language learning for supporting endangered languages.