The Statistics of Non-Linguistic Symbol Systems

Richard Sproat - Google, New York

The Statistics of Non-Linguistic Symbol Systems

Richard Sproat
Google, New York

For 5000 years humans have been using visible marks to encode spoken language. For a far longer period, they have been using visible marks to encode concepts, ideas or, in general, a variety of non-linguistic information. When faced with an ancient symbol system whose meaning is unknown, can one tell if was linguistic (and therefore worth trying to decipher as a language), or some sort of non-linguistic system?

On the face of it, it seems reasonable to use as evidence statistical information on the behavior of symbols in the system. If the symbols distribute in a way that is similar to the distribution of elements (phonemes, morphemes, words, etc) in language, then this could serve as evidence that the system is writing. In causal terms, the fact that it is writing causes the system to show the statistical properties it has.

Recent work that has used this line of argumentation suffers from a variety of problems. First, while such work invariably makes the claim that the statistical measures used are evidence for structure, often the measures actually tell us little or nothing about structure. Second, even if the measures do relate to structure, do they specifically imply /linguistic/ structure? A parse tree looks very similar to a tree that describes the structure of a mathematical formula, so structure per se hardly seems enough. This leads to a third problem with such work in that it depends to some degree on a widespread misconception that non-linguistic systems are structureless. Finally there is the question of whether sample sizes for such systems are ever large enough to make robust statistical claims.

In this talk I review the results of my own work on the statistics of non-linguistic symbol systems, and draw a mostly negative conclusion about the possibility of finding statistical measures that are useful in answering this question.

If you would like to meet with the speaker, please contact Vera Demberg.

Richard Sproat is a Research Scientist at Google, New York.

From January, 2009, through October 2012, he was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University.

Prior to going to OHSU, he was a professor in the departments of Linguistics and Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. He was also a full-time faculty member at the Beckman Institute, and still holds adjunct positions in Linguistics and ECE at UIUC.

Before joining the faculty at UIUC Richard worked in the Information Systems and Analysis Research Department headed by Ken Church at AT&T Labs — Research where he worked on Speech and Text Data Mining: extracting potentially useful information from large speech or text databases using a combination of speech/NLP technology and data mining techniques.

Before joining Ken’s department Richard worked in the Human/Computer Interaction Research Department headed by Candy Kamm. His most recent project in that department was WordsEye, an automatic text-to-scene conversion system. The WordsEye technology is now being developed at Semantic Light, LLC. WordsEye is particularly good for creating surrealistic images that Richard can easily conceive of but are well beyond his artistic ability to execute.

More info — and many more publications — on Richard’s external website here.