Towards a Cross-Lingual Lexical Knowledge Base of Lexical Forms

Last updated: 15 May 2013

How do you overcome problems related to cross-linguality? My specific problem at them moment is caused by the poor coverage of everyday language in lexical resources. For instance, the Swedish single-word expression /egenremiss/ (14,900 hits, April 2013) – or alternatively as a a multiword expession (MWE) – /egen remiss/ (8,210 hits, April 2013) denotes a referral to a specialist doctor written by patients themselves. This expression is made up from two common Swedish words /egen/ `own (adj)’ and /remiss/ `referral’. It is a recent expression (probably coined around 2010*) and not yet recorded in any official dictionary nor in Wiktionary or other multilingual online lexical resources. This compound happens to be very frequent in query logs belonging to a Swedish public health service website. When trying to implement a cross-lingual search, it has come out that none of the existing multilingual lexical resources contained this expression, neither as single word nor as multiword.

Some well-known examples of lexical resources are Princeton WordNet (PWN) (Fellbaum 1998), Berkeley FrameNet (BFN) (Fillmore et al. 2003), and VerbNet (Dang et al. 2000); some multilingual initiatives (some of which include Swedish) are EuroWordNet (Vossen 1998), PAROLE (Ruimy et al. 1998), SIMPLE (Lenci et al. 2000), UWN (de Melo and Weikum 2009), UBY (Gurevych et al 2012), BabelNet (Navigli and Ponzetto 2012), and of course Wiktionary; finally some projects focusing on Swedish are SALDO (Borin and Forsberg 2009), Synlex (Kann and Rosell 2006; Borin and Forsberg 2010), SweFN++ (Borin et al. 2010), Swesaurus (Borin and Forsberg 2011), and SweCxn (Lyngfelt et al. 2012).

MWEs are occasionally recorded in several lexical resources (e.g SveFN++, Borin et al. 2010). However, no existing lexical resource is specifically dedicated to the systematic storage and dynamic update of multilingual MWEs. The obvious consequence of this state of affairs is the poor lexical coverage that negatively affects the performance of many Natural Language Processing (NLP) tasks, such as natural language parsing and generation, as well as real-life applications depending on language technology, such as machine translation or computational lexicography. Although many theoretical and computational methods have been proposed since the first ACL MWE workshop in 2003, MWEs remains a challange in many respect. It is so much so that this year (2013), MWEs have gathered great momentum and two important workshops are centered on MWEs at NAACL 2013 and MT Summit 2013. The latest advances in the field will be presented in the forthcoming special issue on “Multiword Expressions: from Theory to Practice and Use”.

The acronym “MWE” can be used as an umbrella term to cover a complex linguistic phenomenon that has been studied in many areas and labelled in different ways. Just to name but a few, in discourse analysis, first language acquisition, language pathology and applied linguistics, Wray (2008) uses the phrase “formulaic language” that is defined as strings of words which appear to be processed without analysing  the  subcomponents of the string. According to Wray, formulaic language covers a considerable proportion of our everyday language and her concern is with its psychological and social causes. The label “formulaic language” is also used by scholars interested in the educational implications of pre-fabricated expressions. For instance, Erman (2009) looks at the role of formulaic language in an L2 speaker’s oral and written production from the very early to the last steps towards near-native and native.

Last but not least, the anglosaxon corpus-linguistic and lexicographic tradition that refers to Sinclair (1991) and Church and Hanks (1990) prefers the word “collocations”, an expression covers several types of frequently-occurring patterns or, in other words, sequence of words that co-occur more often than would be expected by chance. Important corpus-based tools (e.g. Sketch Engine and lexicographic works (such as Macmillan English Dictionaries) stem from this tradition. In particular, there is a number of specialized dictionaries and corpus-based tools that list frequent collocations in individual languages, such as (for English) Collins COBUILD English Collocations, the LTP Dictionary of Selected Collocations (1997), the Macmillan Collocations Dictionary (2010), the BBI Combinatory Dictionary of English, Online Oxford Collocation Dictionary, Free Online Collocations Dictionary, etc.; (for Spanish) Redes: Diccionario combinatorio del español contemporaneo (2004); the Automatic Collocation Dictionaries for , Bulgarian, Croatian, Czech, English, French, Maltese, Polish, Slovak, Spanish within Sketch Engine.

Completely indipendent from the anglosaxon corpus-based multiword lexicographic tradition is Svenskt Språkbruk (2009), a unique dictionary designed within Språkrådet (Language Council of Sweden) that collects constructions and phrases in Swedish.

Existing MWE resources are invaluable. They have been painstakingly created in the last twenty years and represent an ongoing research where many aspects are still under investigation. From a theoretical viewpoint, they still suffer from constraints that affect the number and variety of recorded entries. From a practical point of view, these lexical resources are not yet cross-lingual, since there is not interlinking across MWEs (when they happen to be registered) in different languages. Addionally, they are not always open-source and freely reusable, so they cannot be easily incorporated in language technology-based applications.

Do you have any thought to share or any suggestion to make about this issue? My impression is that we need to create a new type of lexical resource, namely a cross-lingual lexical knowledge base of single-word and multiword lexical forms that includes everyday language and emerging lexical forms dynamically…

* “egenremiss” is recorded in a glossary of medical terms (Ord- och begreppsförklaringar —  innehållet är anpassat för urogenitala och närliggande sjukdomar), which was last updated in 2010-07-05 (

15 comments for “Towards a Cross-Lingual Lexical Knowledge Base of Lexical Forms

  1. Marina Santini
    15 April, 2013 at 13:33

    Comments from the SENTIMENT ANALYSIS LinkedIn Group (

    Feng Tian • Hi, marina
    ” a cross-lingual lexical knowledge base of single-word and multiword lexical forms”

    What a great idea! What is the form of this? How about make it more fun?

    We build a game based website, in which users come from different countries translating some everyday used languges in the given topics into their mother language?

    Maybe this is useful.


    Stephen Higgins • Talk to Teragram. They were the folks that originally did the Google contextual spelling. they support most major languages — and not just individual words, Think idiomatic phrases. I hope this helps.
    21 hours ago• Like

    Brent Auble • Feng, I believe that’s exactly what Duolingo ( ) does, although they’re doing in the context of teaching people languages rather than building an openly accessible lexical database. I wonder if they could be talked into making some of their data available?
    20 hours ago• Like1

    Nigel Legg • Interesting idea. Though there are a whole host of complexities – a bigram n English could be a trigram or unigram in another language, making design of the datastructure used to develop this tricky – and add in that language is changing at different rates in different places.
    Maybe we need a similar thing for Englsh (not sure Wordnet covers this). In thelast week, various people have used the following casual greetings when addressing me: chuck, love, duck, mate, chap, fellow, kid, buddy, and bro. All of these people were British.
    19 hours ago• Like1

    Feng Tian • To Brent, I had visited the website you mentioned above. It’s different from what i thought.

    The game based website of What I thought maybe characterized with
    1) adopting the means of collabaroative tagging/labeling after imaging they are in different situations, such as greeting your girl friend or your father_in_law. (Note that: the event, the casual greetings when addressing him mentioned by Nigel Legg, is also considered.) according to different topics.
    2) Well designed policy for gaining points after they achieve some task.
    3) users can view each other’s translation for fun;
    4) a backbone framwork for statisitcally calculating these pairs, or triples or …

    Somthing like above is what i mean.


  2. Marina Santini
    15 April, 2013 at 13:39

    Suggestions from the Computational Linguistics LinkedIn Group (

    Maria Teresa Pazienza • In case, please consider to give a look to OntoLing: a tool for Linguistic Enrichment of ontologies (available as extension for different ontology development platforms), developed by our research group ART at Tor Vergata university.
    Please access to the site
    for more details. It relates to a research activity carried on a few years ago.

    Leo Konst • In 1985 I started my own company, Linguistic Systems BV, as an offspring from the University Of Nijmegen.

    The ultimate goal was building systems which were able to understand language and use these systems for several practical purposes such as automatic translation, question answering, data-mining, speech-recognition etc.

    To build linguistic systems we needed tools, but at that time linguistic resources in digital form, like dictionaries and thesauri were not ubiquitous.

    So we decided to build our own tools and since I started with a credit from the Economic Department of the Dutch Government, which was doubled by the Rabo Bank, I could hire a lot of translators, native speakers and linguists to do the job.

    With user-interaction in mind our mission was to build an intuitive multilingual translation dictionary with cross-linked translation between six european languages: English, French, German, Spanish, Italian and Dutch.

    The dictionary should be a dictionary of concepts instead of words, because only on a conceptual level cross-linked translation is possible.

    A concept is defined by a set of words which are synonyms or otherwise semantically related.

    So the set defines another concept than the set and the set defines again another concept.

    Each concept was given a header which could accommodate linguistic information legitimate for the whole concept.

    Each concept was also given a number and in order to add some statistical information, the concepts were ordered in frequency of use on the basis of various Frequency Lists available at that time.

    When a word is entered, the system should give all concepts the word falls in, in sequence of frequency.

    In the ’90s we build a morphological generator and recogniser in six languages and the product Euroglot Professional was born.

    You can look at

    Marina Santini • Thanks for sharing your experience, Leo. I am sure you were a pioneer in the field. Thank you also for stressing the importance of lexical resources for the creation of good NLP tools and language technology-based application.
    As I mentioned in an other discussion group (see the whole list of contributions to this discussion in the blog post’s comments that I keep synchronized), there are two big challenges that we have not addressed yet:
    1) how to go beyond the crystallization of language in “concepts”; there is quite a wide range of lexical expressions that do no fit well into single “concepts”. I am thinking of lexical bundles, trigger phrases, light verbs (which are always multiword expressions), etc…
    2) how to dynamically identify new expressions and then update a multilingual resource with new chunks of language that emerge and become stabilized very quickly?

    As I see it, while the two points above can only be addressed by a newly conceived lexical resource, what we could do to broaden the coverage of existing resources (that are valuable in many ways) is to include the big mass of multiword expressions (from compounds to idioms) that characterize any language.

    Leo Konst • Marina, thank you for your comment. However, we have a lot of multiword expressions put into concepts together with synonym expressions. Idioms, phrasal verbs and even proverbs have been put into concepts and can be triggered by single keywords.
    We also extended our multilingual dictionary with domain specific dictionaries: medical, legal, trade, chemistry, technical etc.
    Try to lookup the dutch word “ontsteking” in and you’ll see what I mean. We have six languages with 30 combinations, adding a language will give 42 combi’s and so on. It’s not easy to add a language, but it can be done. I think that multilingual lexical resources can only be built on a conceptual level, just because there is so much ambiguity.

    Marina Santini • I see what you mean, Leo.
    I tried “ontsteking” –> English, and also “inflammation” –> Italian.
    You said “We have six languages with 30 combinations, adding a language will give 42 combi’s and so on.” I assume you map each language into all the others (a multi-to-multi combination)… I would be more inclined to reduce this complexity by using a pivot language or a hub language… Anyway, thank you for letting us know about euroglotonline. I was personally not aware of it. It is indeed a rich resource. But… 🙂 Hope we will have the opportunity of a live discussion on the many aspecst of cross-linguality sooner or later…


  3. Marina Santini
    15 April, 2013 at 13:43

    Suggestions from the CORPUS LINGUISTICS LinkedIn Group

    Horst Bogatz • Hi Marina,
    I created a cross-lingual collocational dictionary, i.e. The Advanced Reader’s Collocation Searcher. It is an electronic one with all kinds of searches.
    If you want to find a great number of language resources, spoken or written ones, you may want to go to


    Horst Bogatz


    Kaisa Azriouli • I came across the same issue while having my own investigations about language acquisition and the process of accomodation in the crossroads, node, or spectrum of attention amongst various languages, if you will,

    The problem of multi-word expressions like the one presented in the Swedish glossary and the case of lexical syntax targeted for precision in multi-lingual context-bound generalities lies and falls into several branches of both linguistic and psychological research. It is true that the formalism and the operatives therein could be applicable for the fill-in-slot kind of practises where collocation and locution are in the grammar.

    Your suggestion to make separate routes for word to word and multi-word placements is, if possible, in effect advancing possibilities in machine translation for more subtle and possibly also cohesive in the simplest manner.

    It is noteworthy that compounds form the most part of the work in the sense you mentioned in your example.

    Me – being totally unaware of compu-linguistics and programming prerequisites to follow your logics can only say that, for the sake of languages’ diversity your thought and idea is well founded to have linguistic support.

    Horst Bogatz • Hello Kaisa,

    as far as I know, the slot-and-filler theory for explaining language acquisition has not been supported by contemporary neurological research.

    Modern theories involve hierarchical learning systems of linear sequences and lists. Humans have to learn one conceptual hierarchy at a time. Collocations and phrases constitute only one level of the hierarchies. The design of any electronic dictionary, monolingual or cross-language, must mirror the hierarchical structures of interacting subsystems of pattern recognition.
    Google started with a word-for-word translation from one language to another, but now it takes into account the inherent hierarchical nature of language. And the results of its translations have improved.


    Horst Bogatz

    Kaisa Azriouli • Hello Horst,

    My purpose in bringing up the existence of fill-in-slot practice wasn’t meant to refer to language acquisition as such, but to its connotative grounds for use.- I realise that collocational level forms only the level of mnemonic functions in connotational ways, but can be of significance also in the process of translation.

    As said I am not familiar with the programming procedures of translation, but if there will be some ‘interaction’ between the various hierarchical levels of languages, I can understand that it is a great asset forward in machine translation technology.

    Marina Santini • Kaisa, personally and instinctively I am inclined to see language not in terms of “words”, but in terms of “chunks of language in use”. We do not learn single words in isolation, but sequences or combinations of words in specific communicative contexts, These chunks are engraved in our minds by continuous re-use (both for first language and for second languages). So ideally, we should aim at a lexical resource (for language technology and for humans) that follows this chunk-oriented design.

    More technically, Horst, could you suggest references for “hierarchical learning systems of linear sequences and lists”. I would like to know more about them. Thanks in advance.

    Kaisa Azriouli • Hey Marina, I am sure that anybody who has been involved with issues about memory + language knows from one’s sole experience that contextual information gathered together in learning consists of more than just separate words floating apart in the memory and so comprising of the so called chunks or relevant interconnections, or collocational crossroads area from purely synoptic processes, no doubt in that.

    On the other hand the question of having more than one language going through these constant restructuring processes causes a sort of neural bombardment on the one whole of matters as contextual information is on its way to the correlation one.

    I guess that for the machine to sort and gather linguistic information it is the only way to achieve the what there is to be had maximally of the fluidity and flexibility of the natural language development. The greater the amount of each chunk in words the closer to matching the contextual words of the parallels in another language; just like in natural life when memory fallacies are avoided from the matter at hand.

    The remaining part that can’t be statistical yet stays always for the human to elaborate according to the transformational lines within the grammar of each language.

    Just for the curiosity’s sake I would like to tell you on this occasion that my learning to read my mother tongue took place all by itself on purely visual signs that I presume were processed in my head on some sort of gestalt understanding of letters in interaction. The respective ‘chunks’ in relation to the spoken language were thus in the beginning in minimum register, so to speak, but yes I, too, think that the process could be described in having the ‘chunks’ for neural places in the memory.

    Isn’t it by the way true that some research examinations have shown better results in the learning percent of those brought into the method of sequential learning, which is another way that further on when learning comes to be partly creative and only then faces its limits agains the hierarchical and position-based treatment of language?

    Horst Bogatz • Hello Kaisa and Marina,

    I would suggest reading “How to Create a Mind: The Secret of Human Thought Revealed” by Ray Kurzweil. You’ll find many answers to your questions there. It’s an eBook at Amazon.

    Hope it helps.


  4. Marina Santini
    15 April, 2013 at 14:06

    Suggestions from the Information Access and Search Professionals LinkedIn Group (

    John Tait • Marina, I understand your problem, but are you sure we can deal with the issue in a monolingual context right now? As described it seems you are really dealing with problems of language change and of multi-word expressions in languages which use compounding in text.

    Marina Santini • Hi John,
    a timely update of lexical resources for single words and multiwords is both a monolingual and multilingual problem. I was wondering how this problem is addressed by state-of-the art cross-lingual IR… When coming across new words, what’s the current thinking: ignore? translate manually? or ?

    Stephen E. Arnold • Cross lingual work can be challenging. In my experience, the system with the biggest computer and knowledge base does a better job than smaller scale implementations with some of the commercially available products. I wish I could recommend a commercial solution as a snap in which works first time every time. Like most of the functions required for real world information access, the marketing hyperbole is more robust than the actual systems. For our non public work, we use a number of mechanisms. None is without constraints. Keep the group informed of your journey of exploration through the world of near real time, inter lingua, semantic technology. If you crack the code on slang for some Near Eastern and Far Eastern languages, that will be a major step forward. Stephen E Arnold, April 15, 2013

    Stefan De Bruijn • I agree with Stephen. One of the factors here is that most CLIR tasks are based on parallel corpora – and the one that is most frequently used for this (including by us) is the EuroParl corpus. While this gives a nice start point, it is also quite formal use of language… which is not what people use on a day to day basis. To overcome this, we use statistics, lots and lots of statistics on lots and lots of data… as a consequence it suddenly it all boils down to having as much processing power as you can get your hands on. That said, there are different granularities of CLIR; for example you can use probabilistic dictionaries to retrieve texts in other languages and that’s already a lot. If you’re interested, please drop me a note since we’ll be launching a (commercial) search engine within a few months from now and we have these technologies available.

    Stephen E. Arnold • Just a thought. Check out Stephen E Arnold, April 16, 201


  5. 15 April, 2013 at 16:46

    Hi Marina,

    That’s a part of our specialties @ Pythagoria : multilingual text alignment. Really important stuff to perform cross language semantic search. We’ve already prepared all languages from European Union and recently Romansch for the Swiss Government (Ro to FR, Ro to GE, Ro to IT, Ro to EN, … + vice et versa).

    We had to build or own memory based translation materials because of the reason you rightly pointed : the poor coverage of the lexical ressources ! I would add “poor or outdated”.

    Kind regards,

    David FREDRICH

  6. Marina Santini
    16 April, 2013 at 07:32

    Suggestions from the Natural Language Processing People LinkedIn Group

    David FREDRICH • Hi Marina,

    That’s a part of our specialties @ Pythagoria : multilingual text alignment. Really important stuff to perform cross language semantic search. We’ve already prepared all languages from European Union and recently Romansch for the Swiss Government (Ro to FR, Ro to GE, Ro to IT, Ro to EN, … + vice et versa).

    We had to build or own memory based translation materials because of the reason you rightly pointed : the poor coverage of the lexical ressources ! I would add “poor or outdated”.

    Kind regards,

    David FREDRICH

    Ehsan Khoddammohammadi • You may find Uby interesting, It aggregates different lexicons in English and German which contains the sense alignments of lexical entries in the two different language.

    here is the link:

    Simon Hughes • This is a bit out there but here goes. You could train LSA on a matrix composed of sentence pairs. The columns would be the tfidf scores of both the english and foreign language words (or whatever 2 languages you have). LSA learns word associations, so it would simultaneously learn associations between english words and english words, and english words and french words. You could then use such a matrix to compare the similarity of english sentences to foreign sentences. I am not sure if anyone has ever tried this (see disclaimer once more !).

    John Kovarik • For what it’s worth, several years ago, I found you can connect foreign wordnets to English Wordnet automatically as long as you have a good bilingual dictionary using the strategy that the shortest path connects semantically equivalent bilingual pairs. I connected the Chinese nodes of the Chinese thesaurus 同义词词林 to the English nodes of Princeton’s Wordnet. For my notes and technical paper see:

    Zhendong Dong • Please go and look at the HowNet website: . HowNet is a typical cross-lingual lexical knowledge base, presently it covers English and Chinese(implified and traditional). It contains both single words and mwes (110000 forEnglish and 105000 for Chinese). All the entries are not defined in natural languages, but a formal language for the concepts of the entries. For example:
    doctor 1: {human|人:HostOf={Occupation|职位},domain={medical|医},{doctor|医治:agent={~}}}
    doctor 2: {human|人:{own|有:possession={Status|身分:domain={education|教育},modifier={HighRank|高等:degree={most|最}}},possessor={~}}}
    The first definition means: doctor is a person who belong to medical domain, is a professional person, and treats as an agent.
    A mini-Hownet and some other apps are shown on the website and you can download and have a try.
    HowNet attempts to reveal and compute the content behind lexical forms.That is why HowNet can measure the similarity between lexical items in different languages.

    Stoney Vintson • 1. The UWN ( Universal Multilingual Wordnet ) has a Creative Commons 3.0 license which presents a problem for commercial use. There is a paper describing how they realized their multi lingual wordnet. UWN is based on the Princeton English language wordnet project.

    UWN / MENTA site

    Towards a Universal Wordnet by Learning from Combined Evidence

    Web Query Expansion by WordNet ( This paper is cited by the UWN paper )

    1. Global Wordnet

    1. Does alignment and extraction of phrases from parallel corpora help? What synset information do you need? { synonymy, hypernymy / hyponymy, antonymy, meronymy / holonymy, gradation, entailment, torponym } Extracting the synset seems to be the most difficult task.

    Michael Collins Coursera NLP class demonstrates how a linguistically niave IBM model 2 can be used to extract alignment probabilities from parallel corpora / bitext. This can be used to extract MWE / phrase probabilities for use in phrase based statistical machine translation. One of the techniques used to improve the alignment data is to do the alignment in both directions to create an alignment matrix. p( swedish | english ) and p(english | swedish)

    Giza++ ( IBM Models 1-5 )

    Marina Santini • Thanks, Stoney, for your detailed answer. We were currently thinking of implementing a graph model (maybe a probabilistic graph model) and all the Wordnets seem to be a good starting point (regardless their limitations, and especially the paucity of multiword expressions.).

    The pointer to Michael Collins is very much to the point, since one of my main concerns is:
    1) how to go beyond the crystallization of language in “concepts”; there is quite a wide range of lexical expressions that do no fit well into single “concepts”. I am thinking of lexical bundles, trigger phrases, light verbs (which are always multiword expressions)…

    My other big concern is:
    1) how to dynamically identify new expressions and then update a multilingual resource with new chunks of language that emerge and become stabilized very quickly?


    Simone Paolo Ponzetto • Hi Marina, you might find BabelNet interesting for your work:

    Yashar Mehdad • Hi Marina,

    I used mainly Multiwordnet, dictionaries and paraphrase tables in my crosslingual semantic work. In this ACL paper I compare them in crosslingual textual entailment framework:

    Also you might find my PhD thesis which is recent on Cross-Lingual Textual Entailment interesting:

    I also followed the work by Simone and found it interesting but since I was over with my PhD thesis I could not try his resource for my work.


  7. Marina Santini
    16 April, 2013 at 07:37

    Suggestions from the LEX LinkedIn Group (

    Suzanne Franks • The following website does a good job with English/Portuguese and Portuguese/English. . Since I am not fluent in the other languages represented at this site, I can’t comment on how well they perform. As most lexicography today is based on corpora, this site collects bilingual parallel corpora from professional translation websites

  8. 16 April, 2013 at 07:44

    From Genres on the Web Facebook page (

    Kaisa Azriouli I was also thinking while writing on the subject that maybe I’d better write on Linkedin, but as it first caught my eyes here and the subject is under my interests and overlapping some resultive thoughts I am processing in my linguistic accomodation process I just kept on writing right away from the subject in a conclusive manner bound to my present speculations about language acquisition on multi-dimensional frames.

  9. Ismael Arinas
    17 April, 2013 at 06:47

    Hello Marina,
    In this paper they describe the knowledge base FunGramKB which is designed with a linguistic model as foundation and its purpose is to support NLP applications. It will not extract terms fully automatically and in its ontology you have to introduce the new terms, but once introduced, the interlinguistic linking is very easy.

  10. 17 April, 2013 at 09:31

    Thanks Ismael!

  11. 18 April, 2013 at 09:00

    Suggestions from the Forensic Linguistic Evidence LinkedIn Group (

    Carole Chaski, PhD • Mark Davies has announced a new corpus, all scraped from the web (and as far as I can tell, not vetted for authorship) that has over 1.9 billion words from 20 countries (all ENGLISH –not multilingual). It can be used for dialect and language change so I think it could be a source; check it out at

    CLIR (cross-linguistic information retrieval– has been ongoing for several years and relies heavily on machine translation dictionaries. Are you thinking of that kind of lexical resource?

    Marina Santini • Hi Carole,
    I am thinking of a cross-lingual lexical resource that can be used for NLP and by humans. Afterall, also NLP is made for humans 🙂

    Carole Chaski, PhD • I think you can get hold of CLIR dictionaries in human readbale form –after all, humans cfreate these things to be implemented by machine, not the other way round (yet lol). I suggested the CLIR work because it has already been done, and you wouldn’t have to start from scratch. On the other hand, sometimes the data you want is proprietary or in a format that is so diffcult to use you just start all over again anyway.

    Another idea is going back to Joseph Greenberg’s work –he did a lot of work in linguistic typoilogy and I think he made cross-lingual word-lists. Adn SIL may have something along these lines, especially for the under-represented languages (since SIL work in indigenous linguistic analysis for literacy and Bible translation). Just some ideas.


  12. 20 April, 2013 at 17:02


    Here’s another tool that is related. is a cross-language phonetic transcription engine. Linguos is focused on non-Roman scripts/languages and covers virtually every major script. The system can take input in English or any of these non-Roman languages and provide output in any desired language.

    Linguos applies language specific algorithms to generate phonetic matches and uses these transcriptions to search against target indexes. It also has a large internal corpus to pre-create search terms when searching against external search engines.

    A couple of pages to try:

    Hope this helps in your efforts.


  13. 1 May, 2013 at 08:04

    Thanks for the references, Venkat!

  14. 1 May, 2013 at 08:07

    From Applied Linguistics LinkedIn Group (

    Katerina Xafis • Extremely interesting and a problem others have discovered too! Your suggestion sounds good so that cross-lingual single and multi-word entries can be accessed, especially new ones that have not yet been officially recorded anywhere.
    Katerina Xafis • I think the problem runs deeper because newly coined lexical items are not easy to find in monlingual resources, let alone cross-lingually. Perhaps monolingual lexical items need to be systematically covered under single and multi-word entries before a cross-lingual base is possible. I’m saying this because what is multi-lexical in one language may well be a single word in another. You seem to have discovered a hot spot requiring attention in lexical resources, monolingually and cross-lingually.
    William Charlton • Marina, As Katerina says, it’s hard enough to track all this in one language let alone in multiple languages. Many translators have personal and sometime public “memories” . These would seem to me to be the obvious starting point for a conduit to collect a global multi-lexical database of language specific jargon, neologisms and the like. It would need a BIG system to collate and manage all this.
    It is a classic stumbling block for translators where the source text contains some phrase, shorthand or common (in the source language) vernacular or abbreviation which has no bilingual dictionary entry. I found this when translating from a piece from Italian and the author had written a parenthesised comment followed by n.d.a – Non Disclosure Agreement maybe? It stands for Nota Dell’ Autore but finding this out was not easy but obvious once I had.
    How? is the big question because what is needed is Google/Bing/machine translation on steroids.
    I agree it’s needed and will become increasingly so, I suspect.
    Marina Santini • Thanks for your feedback and support, Katerina and William!

    Katerina Xafis • Marina, you made me look up an expression (early bird) to see if it is in online monolingual dictionaries, such as early-bird discounts/prizes etc. So far, I have not found it in this context in the main dictionaries. You have to google the specific phrase to find its definition (eg the whole phrase ‘early bird discount’).
    Mária Adorján • I think creating a global multi-lexical database is only feasible if we break down the task for several domains and sub-fields. E.g. we can think of a multilingual database of business/within this: legal/within this: contracts, etc. Data may be collected from local universities and the project coordinated by big publishing houses, financed by EU?UN? New forms of cooperation will definitely be necessary to organise the existing knowledge.
    Marina Santini • Katerina, I could not find record of “early bird registration” (which is very common in the academic world when it comes to conference organization and participation). Reverso says:

    Maria, your suggestion is very much in line with my wishful thinking. We just have wait for the right EU call… I wonder what Horizon 2020 has in store for us 🙂

    Manuel Faelnar • Yes, Examples in Cebuano, a new language in Google Translate and Google Search:

    1. The single English word “wash” is multilexical: (a) “wash face” is “hilam-os”,
    (b) “wash hands” is “hinaw” or “hunaw”, (c) wash feet is “himatiis”, (d) wash limbs is “himasa”, (e) wash “up is ‘hugas’, (f) wash clothes is “laba” (from Spanish “lavar’).

    2. The English word “problem” is bilexical in Cebuano: (a) “problem” (emotional) is
    “suliran”, (b) “problem” (in school exams, mathematics, science, puzzles, etc) is


    Martin Benjamin • Marina, the problem you describe is comprehensively treated in the Kamusi platform, . Registered users can submit terms for review, be they single words or multi-word expressions, in any language that is configured for the system (40+ and counting – if you want to add yours, please contact me). Each entry (unique combination of a spelling and a meaning) requires a definition in its own language, and can then be linked to similar concepts, either single words or MWEs, in any other language.

    There’s a lot more to it than that, and a lot of programming for enhanced features that we are still working on, but the system is built to accomplish exactly what you describe, and it has been working since February. Now we need to build up the data, so please register and try adding some of the terms you wish were easily available in an online dictionary.

    (In English, we are calling MWEs “noun phrase”, “verb phrase”, etc, as part of speech. Those may not be the terms that linguists would most prefer, but we need some terms that can be easily understood by the general public – better suggestions appreciated.)

  15. 5 May, 2013 at 10:12

    From KD2U – Knowledge Discovery in Distributed and Ubiquitous… (LinkedIn Group)

    Ina Lauth • Check “Clef Campaign” and TREC for these Kind of samples. I think that also under the CLARIN infrastructure Projects, there are different Websites for different languages (e.g. CLARIN-D for German), you can find all language ressources you Need. CLARIN project has been founded just for this purpose, to connect all Centers with all language ressources available.

Leave a Reply

Your email address will not be published. Required fields are marked *