The WebGenre Blog: The power of genre applied to digital information. By Marina Santini » Entries tagged with "corpora"

Summary: Looking for Corpora…

Dear All, In this post I collect all the suggestions I got for the following request: “Looking for Corpora in….” Big thanks to (hope I have not forgotten anybody): Johannes Heinecke, Dominika Rogozinska, Mohamed-Zakaria KURDI, Bartosz Ziólko, Olga Whelan, Margarita Borreguero, Zuloaga, Ayesha Zafar, Will Snellen, Katherine (Katie) Skees Hund, Anna Matyszczyk, Massinissa Ahmim, Marcin Feder, Maria Pia Montoro, Lawrence Niculescu, Jesus Vilares, Ewa Gwiazdecka, Jack Bowers, Taner Sezer, Yvonne Adesam, Kadri Muischnek, Anne Tamm, Ralf Steinberger, Ricardo Campos, Edyta Jurkiewicz-Rohrbacher, Pat, Sara Castagnoli, Adam Przepiorkowski, Hung Le Khanh,  Kristian Kankainen, Norton Roman, Mansur Sayhunov. Suggestions were sent through: Mailing Lists: Corpora List (, BCS IRSG ( LinkedIn Groups: Corpus Linguistics, Computational Linguistics, Natural Language Processing, Applied linguistics, Terminology Services. Hope this list of corpora is useful for everybody working with multi- and cross-linguality. Please … Read entire article »

Filed under: dissemination, references, summaries

Looking for Corpora to explore Cross-Linguality

Dear All, I am looking for corpora of any genre in the following languages: English, Swedish, Polish, Italian, Finnish, Estonian, and Hungarian. I am already aware of a number of corpora (several posts in this blog are dedicated to the dissemination of corpora-related information). These corpora are mostly in English. I would like now to focus on: 1) additional languages and 2) additional genres, such as search query logs, tv scripts, emails, tweets, whatsup messages, etc. All genres are well accepted! The only requirement is: corpora must be free and publicly available. Everybody must be able to replicate or extend experiments using the same corpora/datasets. The purpose of the experiments is to explore cross-linguality in different settings. Please, read the use cases below in order to have an idea of the type of communicative situations we … Read entire article »

Filed under: featured, requests

Summary: Multi-dimensional Social Network Datasets

Last Updated: 8 Oct 2012 Here is a summary of the suggestions received so far to the request for multi-dimensional social network datasets/corpora/collections (read the request here). Please do not hesitate to contact me for further suggestions. Suggestions Datasets: * Twitter social graph and celebrirty graph dataset( * Facebook social graph, wieghted random walks dataset and lastfm multigraph dataset on ( Also check out software and publications on the same link. * Facebook 100 million dataset( Data was shared on torrent sites (they might not be available anymore). * Check out site like infochimps dataset(, if you can find some relevant dataset there Tool: * You can pick up on co-authorship relationships in the academic world using Elsevier’s data: they have products that do the analysis for you, called SciVal. Websites that can be used to create multi-dimensional social datasets: * ArnetMiner … Read entire article »

Filed under: summaries

Review: Creating Corpora With Active Learning

PhD thesis reviewed by Marina Santini Fredrik Olsson, Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora. Doctoral thesis, University of Gothenburg, 2008 Download thesis from this page: The PhD thesis “Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora” by Fredrik Olsson contains 13 chapters and an appendix with the base learner parameter settings. The Introduction unfolds the problem and the argument, and the remaining 12 chapters describe the Background (Part I, Chapters 2-5), presents the BootMark method ( Part II, Chapter 6), test the proposed method (Part III, Chapters 7-12) and summarize findings, experience, and viable future directions (Part IV, Chapter 13). The thesis describes a bootstrapping method for named-entity recognition based on active learning — BootMark. The … Read entire article »

Filed under: reviews