Looking for Corpora to explore Cross-Linguality

Dear All,

I am looking for corpora of any genre in the following languages: English, Swedish, Polish, Italian, Finnish, Estonian, and Hungarian.

I am already aware of a number of corpora (several posts in this blog are dedicated to the dissemination of corpora-related information). These corpora are mostly in English. I would like now to focus on: 1) additional languages and 2) additional genres, such as search query logs, tv scripts, emails, tweets, whatsup messages, etc. All genres are well accepted! The only requirement is: corpora must be free and publicly available. Everybody must be able to replicate or extend experiments using the same corpora/datasets.

The purpose of the experiments is to explore cross-linguality in different settings. Please, read the use cases below in order to have an idea of the type of communicative situations we would like to explore.

Thanx in advance for your suggestions and pointers. Marina

Use Case 1. Information Access

In multi-ethnic societies, it is common that many non-native speakers use public websites  to access information vital to their lives and their integration in a new country. National regulations are often accompanied by special terminology and new coinages. For instance, the Swedish expression egenremiss denotes a referral to a specialist doctor written by patients themselves. This expression is made up from two common Swedish words egen ‘own (adj)’ and remiss ‘referral’. It is a recent expression (probably coined around 2010) and not yet recorded in any official dictionary nor in Wiktionary or other multilingual online lexical resources. However, it is a very frequent search query in Swedish public health websites and in Google.se searches. In a preliminary research, it turned out that none of the existing multilingual lexical resources contained this expression. Furthermore, Google Translate (the de facto baseline resource for many cross-lingual experiments) proposed a translation − i.e. “private referral” − that might be incorrect, since it is unclear whether it is a good equivalent of “egenremiss”, which would be presumably more appropriately translated with “self-referral”.

One solution would be to start storing emerging lexical forms harvested from large web corpora and other underutilized sources of lexical knowledge, like query logs, and validate them with the help of domain experts and professional lexicographers.

 Use Case 2. Lexicon users’ needs

The use of expressions that are marked for style, genre, domain, or register (and/or other textual categories), or the use of expressions which are misspelled or idiomatic for some textual category are beyond the competence of a novice reader or a non-native speaker. This is especially true in social networks, since one cannot tell if the texts one reads are good or bad the way native readers can. When readers read a language they do not know at all, they can use automatic translation or online dictionaries or other lexical resources. However, they cannot determine the type of text they are reading. They cannot tell if the text is verbose, terse, formal, informal, stupid, funny, bad, or good. For instance, the Italian phrase Ciao bella [literally: ‘hello beautiful (woman/girl)’ ] is a salutation that in everyday speech is not linked to the concept of being a ‘beautiful woman/girl’. It is an informal − familiar or colloquial − phrase that expresses some kind of affection or familiar relation to a female human. The masculin *Ciao bello is not used. If the multiword expression Ciao bella is found in a text, it indicates that the stylistic register is or becomes familiar and more intimate. The suggested translation should then not be something like “Hi beautiful” in English or “Hej vacker” in Swedish, but rather “Hello dear” or “Hello love” in English and simply “Hej på dig” in Swedish. To make another example, the phrase have a buzz − often used in the scripts of Coronation Street (a famous British soap) − means that this line is spoken by a Mancunian and those who do not live or have friends in Manchester might miss that it means “have a good time”.

Users would benefit from context-of-use labels telling how lexical forms contextualise across registers, styles, domains, or genres, on a micro-level as well as on a text level. —- this is the end of the post

