Professor Pierrehumbert presents at SIGMORPHON 2018 in Brussels

Professor Pierrehumbert presents at SIGMORPHON 2018 in Brussels

Professor Janet Pierrehumbert presented joint work with Ramon Granell on 31 October in Brussels at SIGMORPHON 2018: 15th SIGMORPHON.

SIGMORPHON (Special Interest Group on Computational Morphology and Phonology) is a group of the Association for Computational Linguistics. The 15th SIGMORPHON Workshop took place in Brussels, Belgium on 31st October, in association with the EMNLP 2018 conference.

Professor Janet Pierrehumbert presented a poster titled ‘On hapax legomena and morphological productivity’, based on work carried out jointly with Ramon Granell. 

To express new concepts, people often make new words by recombining parts of words they already know. This is related with word-formation using existing words (e.g. combining "air" +"taxi” to make a new word "air-taxi"). Other people can understand these words, but they pose great difficulties for speech and language technology. The researchers carried out an analysis of Wikipedia directed towards predicting which of these types of word-formation patterns will be used most actively in the future.

A hapax legomena is a word that occurs exactly once in a corpus – a written work, or the body of work of a particular author, for example. Morphological productivity describes the way morphemes (a unit of a language that cannot be further divided, e.g. ‘stay’) are used to create new words. The hapax legomena is the best known indicator of production of new words in a language, for example ‘staycation’.

Professor Pierrehumbert and Ramon Granell’s work evaluates the hapax measure using a much larger corpus than used previously (1.24 billion words from a 2013 download of Wikipedia) in order to ask; ‘Are hapax counts predictive for larger corpora and test sets? Are the simplifying assumptions valid? What do the results suggest about human language processing?’ The work explores more morphemes than previous work, including all 133 prefixes, suffixes, and compounding elements that meet specific inclusion criteria.

Surprisingly, the most productive patterns are not the most common ones in words that everyone knows. Instead, new words are most likely to reuse elements of rare and specialized vocabulary that distinguishes amongst people with different expertise. This result shows the need for language processing algorithms that can adapt their models of word-formation to different topics and social groups.