Summary: Looking for Corpora…

Dear All, In this post I collect all the suggestions I got for the following request: “Looking for Corpora in….” Big thanks to (hope I have not forgotten anybody): Johannes Heinecke, Dominika Rogozinska, Mohamed-Zakaria KURDI, Bartosz Ziólko, Olga Whelan, Margarita Borreguero, Zuloaga, Ayesha Zafar, Will Snellen, Katherine (Katie) Skees Hund, Anna Matyszczyk, Massinissa Ahmim, Marcin Feder, Maria Pia Montoro, Lawrence Niculescu, Jesus Vilares, Ewa Gwiazdecka, Jack Bowers, Taner Sezer, Yvonne Adesam, Kadri Muischnek, Anne Tamm, Ralf Steinberger, Ricardo Campos, Edyta Jurkiewicz-Rohrbacher, Pat, Sara Castagnoli, Adam Przepiorkowski, Hung Le Khanh,  Kristian Kankainen, Norton Roman, Mansur Sayhunov. Suggestions were sent through: Mailing Lists: Corpora List (, BCS IRSG ( LinkedIn Groups: Corpus Linguistics, Computational Linguistics, Natural Language Processing, Applied linguistics, Terminology Services. Hope this list of corpora is useful for everybody working with multi- and cross-linguality. Please … Read entire article »

Cloud & Big Data Day

On 24th Sept 2013, I attended the CLOUD & BIG DATA DAY in Stockholm (Kista) organized by SICS and EIT ICT Labs. Cloud & Big Data Day is part of SICS Software Week that takes place every year. The specific purpose of the Cloud & Big Data Day was to “feature leading international and Swedish experts from industry and academia, who present the cutting edge of cloud computing technologies. The intended audience is professionals in IT and its applications for all areas in industry and academia”. The presentations were all interesting and covered a wide range of projects and applications centered on BIG DATA: from how to harness pentabytes of data at Spotify, to big cellular network data; from Hop (Hadoop Open Platform-as-a-Service) to ConPaaS (Platform as a Service for Multi-clouds), … Read entire article »

Summary: Multi-dimensional Social Network Datasets

Last Updated: 8 Oct 2012 Here is a summary of the suggestions received so far to the request for multi-dimensional social network datasets/corpora/collections (read the request here). Please do not hesitate to contact me for further suggestions. Suggestions Datasets: * Twitter social graph and celebrirty graph dataset( * Facebook social graph, wieghted random walks dataset and lastfm multigraph dataset on ( Also check out software and publications on the same link. * Facebook 100 million dataset( Data was shared on torrent sites (they might not be available anymore). * Check out site like infochimps dataset(, if you can find some relevant dataset there Tool: * You can pick up on co-authorship relationships in the academic world using Elsevier’s data: they have products that do the analysis for you, called SciVal. Websites that can be used to create multi-dimensional social datasets: * ArnetMiner … Read entire article »

Automatic Language Analysis for Suicide Prevention

Text/Content Analytics for Suicide Prevention (II) Last Updated: 2nd October 2012 Last week I sent out a request about suicides’ language analysis on several LinkedIn groups asking for pointers to previous studies and existing material that could enrich the list of references proposed as a starting point (see here).  Noteworthy suggestions and useful reflections are summarized below: • The work of James Pennebaker in Texas. They have their own corpus-like tool (LIWC) which has been used for this purpose too. He did an analysis of Sylvia Plath and other poets’ writings (several who met tragic ends at their own hands) and had some very interesting findings about their use of pronouns in particular. • References: Pennebaker did a study on the language of suicidal poets, and also on depressed and depression vulnerable college students (among … Read entire article »

Summary: Where is the future? From big data to contextualized information

Comments to the post: The Path Forward: From Big Unstructured Data to Contextualized Information ( Discussion on LinkedIn: American Society for Information Science & Technology ( Tom Reamy • Hi Marina, good blog – and as someone dealing with the idea of context in text analytics for many years, I’m in total agreement as to its importance. There are quite a few other types of context that are important as well. Another conversation. As far as text analytics tools dealing with this – most of them can but the ones with a full set of operators will probably do best. Two contextual areas come to mind immediately – how to get TA software to recognize context like genre when it is not specified and how to take context into account in categorization or extraction rules. The … Read entire article »

