Last updated: 15 May 2013
How do you overcome problems related to cross-linguality? My specific problem at them moment is caused by the poor coverage of everyday language in lexical resources. For instance, the Swedish single-word expression /egenremiss/ (14,900 hits, Google.se April 2013) – or alternatively as a a multiword expession (MWE) – /egen remiss/ (8,210 hits, Google.se April 2013) denotes a referral to a specialist doctor written by patients themselves. This expression is made up from two common Swedish words /egen/ `own (adj)’ and /remiss/ `referral’. It is a recent expression (probably coined around 2010*) and not yet recorded in any official dictionary nor in Wiktionary or other multilingual online lexical resources. This compound happens to be very frequent in query logs belonging to a Swedish public health service website. When trying to implement a cross-lingual search, it has come out that none of the existing multilingual lexical resources contained this expression, neither as single word nor as multiword.
Some well-known examples of lexical resources are Princeton WordNet (PWN) (Fellbaum 1998), Berkeley FrameNet (BFN) (Fillmore et al. 2003), and VerbNet (Dang et al. 2000); some multilingual initiatives (some of which include Swedish) are EuroWordNet (Vossen 1998), PAROLE (Ruimy et al. 1998), SIMPLE (Lenci et al. 2000), UWN (de Melo and Weikum 2009), UBY (Gurevych et al 2012), BabelNet (Navigli and Ponzetto 2012), and of course Wiktionary; finally some projects focusing on Swedish are SALDO (Borin and Forsberg 2009), Synlex (Kann and Rosell 2006; Borin and Forsberg 2010), SweFN++ (Borin et al. 2010), Swesaurus (Borin and Forsberg 2011), and SweCxn (Lyngfelt et al. 2012).
MWEs are occasionally recorded in several lexical resources (e.g SveFN++, Borin et al. 2010). However, no existing lexical resource is specifically dedicated to the systematic storage and dynamic update of multilingual MWEs. The obvious consequence of this state of affairs is the poor lexical coverage that negatively affects the performance of many Natural Language Processing (NLP) tasks, such as natural language parsing and generation, as well as real-life applications depending on language technology, such as machine translation or computational lexicography. Although many theoretical and computational methods have been proposed since the first ACL MWE workshop in 2003, MWEs remains a challange in many respect. It is so much so that this year (2013), MWEs have gathered great momentum and two important workshops are centered on MWEs at NAACL 2013 and MT Summit 2013. The latest advances in the field will be presented in the forthcoming special issue on “Multiword Expressions: from Theory to Practice and Use”.
The acronym “MWE” can be used as an umbrella term to cover a complex linguistic phenomenon that has been studied in many areas and labelled in different ways. Just to name but a few, in discourse analysis, first language acquisition, language pathology and applied linguistics, Wray (2008) uses the phrase “formulaic language” that is defined as strings of words which appear to be processed without analysing the subcomponents of the string. According to Wray, formulaic language covers a considerable proportion of our everyday language and her concern is with its psychological and social causes. The label “formulaic language” is also used by scholars interested in the educational implications of pre-fabricated expressions. For instance, Erman (2009) looks at the role of formulaic language in an L2 speaker’s oral and written production from the very early to the last steps towards near-native and native.
Last but not least, the anglosaxon corpus-linguistic and lexicographic tradition that refers to Sinclair (1991) and Church and Hanks (1990) prefers the word “collocations”, an expression covers several types of frequently-occurring patterns or, in other words, sequence of words that co-occur more often than would be expected by chance. Important corpus-based tools (e.g. Sketch Engine and lexicographic works (such as Macmillan English Dictionaries) stem from this tradition. In particular, there is a number of specialized dictionaries and corpus-based tools that list frequent collocations in individual languages, such as (for English) Collins COBUILD English Collocations, the LTP Dictionary of Selected Collocations (1997), the Macmillan Collocations Dictionary (2010), the BBI Combinatory Dictionary of English, Online Oxford Collocation Dictionary, Free Online Collocations Dictionary, etc.; (for Spanish) Redes: Diccionario combinatorio del español contemporaneo (2004); the Automatic Collocation Dictionaries for , Bulgarian, Croatian, Czech, English, French, Maltese, Polish, Slovak, Spanish within Sketch Engine.
Completely indipendent from the anglosaxon corpus-based multiword lexicographic tradition is Svenskt Språkbruk (2009), a unique dictionary designed within Språkrådet (Language Council of Sweden) that collects constructions and phrases in Swedish.
Existing MWE resources are invaluable. They have been painstakingly created in the last twenty years and represent an ongoing research where many aspects are still under investigation. From a theoretical viewpoint, they still suffer from constraints that affect the number and variety of recorded entries. From a practical point of view, these lexical resources are not yet cross-lingual, since there is not interlinking across MWEs (when they happen to be registered) in different languages. Addionally, they are not always open-source and freely reusable, so they cannot be easily incorporated in language technology-based applications.
Do you have any thought to share or any suggestion to make about this issue? My impression is that we need to create a new type of lexical resource, namely a cross-lingual lexical knowledge base of single-word and multiword lexical forms that includes everyday language and emerging lexical forms dynamically…
* “egenremiss” is recorded in a glossary of medical terms (Ord- och begreppsförklaringar — innehållet är anpassat för urogenitala och närliggande sjukdomar), which was last updated in 2010-07-05 (http://home.swipnet.se/isop/ordforklaringar.htm).