Summary: Looking for Corpora…

Dear All,

In this post I collect all the suggestions I got for the following request: “Looking for Corpora in….”

Big thanks to (hope I have not forgotten anybody): Johannes Heinecke, Dominika Rogozinska, Mohamed-Zakaria KURDI, Bartosz Ziólko, Olga Whelan, Margarita Borreguero, Zuloaga, Ayesha Zafar, Will Snellen, Katherine (Katie) Skees Hund, Anna Matyszczyk, Massinissa Ahmim, Marcin Feder, Maria Pia Montoro, Lawrence Niculescu, Jesus Vilares, Ewa Gwiazdecka, Jack Bowers, Taner Sezer, Yvonne Adesam, Kadri Muischnek, Anne Tamm, Ralf Steinberger, Ricardo Campos, Edyta Jurkiewicz-Rohrbacher, Pat, Sara Castagnoli, Adam Przepiorkowski, Hung Le Khanh,  Kristian Kankainen, Norton Roman, Mansur Sayhunov.

Suggestions were sent through:
Mailing Lists: Corpora List (, BCS IRSG (
LinkedIn Groups: Corpus Linguistics, Computational Linguistics, Natural Language Processing, Applied linguistics, Terminology Services.

Hope this list of corpora is useful for everybody working with multi- and cross-linguality. Please do not hesitate to contact me if you wish to contribute and add more corpora/resources to this list (

Please, let me know if I have inadvertently disregarded you pointers.
Cheers, Marina

— start-of-the-list — Last Updated 13 May 2014


  • Multi-lingual corpora & resources

If you can live with some noise, why not using Wikipedia? You can download a dump of the entire Wikipedia of the languages you need (without old version and images), e.g. for Estonian (replace ee in eewiki by the ISO language code of the other languages you need. Once you have the dump (XML file) you can extract the page tag which contains the WikiPage including the mediawiki syntax. The latter can be parsed with toolkits like . I made the experience that very short pages are often stubs so they do not contribute “good” text to your corpus. Very long pages are often listings ( ) too and therefore may not be adequate for your task. In order to parse the wiki syntax and get the text, several tools exist, for instance .
texts published by the European Parliament
The Gutenberg project contains many books in several languages.
Some other ideas for finding texts you might get checking OPUS: (parallel corpora)
Massive multi-lingual corpora:
CLEF: ( It may be a good starting point.
The NLTK (Python) toolkit includes corpora in several of the languages you are looking for (and several others), in a variety of genres. Languages that are included for sure are: English, Italian, Polish, I think there may also be Swedish and Hungarian as well, but I’m not 100% sure….
You can also get the corpora without the rest of the NLTK tools:
At the JRC’s Language Technology page, you find parallel corpora for all the languages you are searching for, and more.
Semantics-based Multilingual dictionary a t
Google’s Cross-Lingual Dictionary (GCLD) — Released: 2012.175M short, unique text strings that were used to refer to one of 7.6M Wikipedia articles (source: The details of the data and how it was constructed are in an LREC 2012 paper by Valentin Spitkovsky and Angel Chang, A Cross-Lingual Dictionary for English Wikipedia Concepts ( Get the data here:

11) Multi-Wordnet:

12) Multilingual Corpora listed by Isabella Chiari:

  •  Estonian

1) you can access the Estonian corpora more comfortably at as these pages have also English versions. Most of these corpora have also a morphologically analysed version (lemma, POS, grammatical categories) available at request, free for non-commercial purposes.
2) Estonian:

There is also a quite comprehensive list of all sorts of resources for
Estonian (wordlists, biographical data collections, dialect data,
phonetical resources, spoken language, internet language, learner
language corpora, etc) here:
All descriptions are in estonian only now

  • Finnish

you can get an access to quite a decent corpora of Finnish from The Language Bank of Finalnd. For that, however, you would need to register (which is pretty simple), link here:
Other options are:
– Corpus of Institute for the languages of Finland, which contains also some older texts

– project Gutenberg.

  • Hungarian


  • Italian
  1. BADIP: BAnca Dati dell’Italiano Parlato (
  2. CLIPS: Corpora e Lessici dell’Italiano Parlato e Scritto (
  3. CoLFIS: Corpus e Lessico di Frequenza dell’Italiano Scritto (
  4. CORIS/CODIS: A corpus of written Italian – CORIS/CODIS is available on-line for research purposes (
  5. PAISÀ: Piattaforma per l’Apprendimento dell’Italiano Su corpora Annotati ( (see also
  6. Italian Corpora listed by Isabella Chiari:
  7. Corpora and other linguistic resources in Italian:
  8. You can download this MS Word file for a list of Italian Corpora and Lexica
  9. Interesting corpus tool showing syntactic frames and slots:
  • Polish

1) You can find a comprehensive list of Polish corpora here:

2) The main corpus of Polish is the National Corpus of Polish
( with two search engines:


3) See also:

  • Portuguese Brazilian

Two corpora of human-produced dialogue summaries. It will soon also present a set of tools for corpus segmentation and annotation (specifically tailored to dialogue summary annotation), along with publications and technical documents relating to the corpora:

  • Romanian

Here a few links for Romanian corpora:

  • Spanish

• Various corpora compiled by Real Academia Española, all containing both European and American texts:

  • Swedish

Probably, the biggest collection corpora for Swedish is Språkbanken ( It’s free and it’s concordance searching (

Some of the corpora can even be downloaded. Some of them are restricted but a lot of corpora are free. And most of the others are available as ‘sentence sets’ if the sentence is enough for you context wise (the texts are scrambled so that the sentence is intact, but not the entire text. Contact Språkbanken (, if you would like some help picking out corpora!

  • Tatar


  • Turkish

Taner Sezer says: I’m working on building Turkish corpora, and yet, I build four.
Respectively, they are composed of online newspapers and some other various websites (+491M tokens), Turkish Wikipedia (+47M tokens), a tweets corpus (2009-2011) composed of 1 million tweets (+13M tokens) and a small corpus of online newspapers (~500 thousand tokens).
Each corpus is PosTagged and they also include morphological tagging.
The small newspaper corpus has metada for year, genre and source. And tweets corpus has a new tag set including internet abbreviations, smileys, internet emphasis (the words where a character is written multiple times in order to emphasis the word in context) and misspelled words.
Turkish is one of the widely spoken languages. On the other hand, because of missing tools, software and lack of useable corpora for linguistic studies, Turkish take a backseat in computational linguistics.
To sum up, I’ll be very pleased if I may provide data for your study and help about Turkish as a native speaker.
Best Regards, Taner Sezer

  • Urdu

• For Urdu language u may get corpora from


• this is a tool for text analysis- might help you with the social media aspects of your research-and it is free

• Ricardo Campos says: you can find some interisting datasets in my webpage (tab datasets):

— end-of-the-list —

5 comments for “Summary: Looking for Corpora…

  1. 11 May, 2014 at 09:10

    You are right, Riyaz 🙁

  2. 12 May, 2014 at 08:52

    Hello, Marina!

    I see Turkish and Urdu languages in this list. Just in case, maybe you are also interested in Tatar language corpus:

    Best regards…

  3. 12 May, 2014 at 09:00

    Thank you for the Tartar corpus page. What is your name? I wish to add your name to the list of contributors (see blog post above)

  4. 12 May, 2014 at 09:10

    Hello again. That’s my name 🙂
    Nowadays, usually this language called not “TaRtar” but “Tatar” 😉

  5. 12 May, 2014 at 09:13

    Ok, sorry 😉

Leave a Reply

Your email address will not be published. Required fields are marked *