Lecture 7: Learning from Massive Datasets

Lecture 7: Learning from Massive Datasets from Marina Santini In this lecture we explore how big datasets can be used with the Weka workbench and what other issues are currently under discussion in the real world, for ex: big data applications, predictive linguistic analysis, new platforms and new programming languages.

Cloud & Big Data Day

On 24th Sept 2013, I attended the CLOUD & BIG DATA DAY in Stockholm (Kista) organized by SICS and EIT ICT Labs. Cloud & Big Data Day is part of SICS Software Week that takes place every year. The specific purpose of the Cloud & Big Data Day was to "feature leading international and Swedish experts from industry and academia, who present the cutting edge of cloud computing technologies. The intended audience is professionals in IT and its applications for all areas in industry and academia". The presentations were all interesting and covered a wide range of projects and applications centered on BIG DATA: from how to harness pentabytes of data at Spotify, to big cellular network data; from Hop (Hadoop Open Platform-as-a-Service) to ConPaaS (Platform as a Service for Multi-clouds),

Reflection: Analysing Emotions of Social Writing

by Marina Santini A few days ago, I attended a fascinating session organized by the Quantified Self Stockholm (QS) MeetuUp, in a venue with an inspiring name, Psykologifabriken (The Psychology Factory), in center Stockholm. This QS session – Adding Power to body and soul… – included two presentations: one about adding power to the body through a robotic glove that adds gripping energy to the hand of those who have lost strength in this limb; the other one about methods to enable self-development through digital tools. Since I am not into robotics, I will only say that the empowering glove shown by Johan Ingvast from Bioservo is simply amazing… I am not a psychologist either, but I found the presentation about empowring the "soul" very relevant to some of my interests, namely sentiment analysis, mood

Report: Language in the Digital Age – META-NORD National Workshop

Report: Language in the Digital Age – META-NORD National Workshop by Marina Santini Held in Stockholm, Sweden, 23 Nov 2012 Download program and presentations here. I was very happy to attend the workshop "Language in the Digital Age" last week in Stockholm. It was informative and inspring. The workshop's venue – Stacken at Nalen's (a building from the end of XIX century) – is a fascinating example of architectonic re-use. Stacken (literally meaning "The Stack", but probably a nickname to refer to the boxing ring) was the former boxing gym of the still existing Narva Boxningsklubb. Now Stacken is an cosy conference/banquet room decorated with four thin columns that add status and elegance to events ( The speakers and the audience (about 50 people) represented a wide range of interests, from the linguistic needs of the

Meetup Report: Big Data & Predictive Modeling – What’s happening in Sthlm?

On Thursday, September 6, 2012 the first meetup on BIG DATA & PREDICTIVE MODELING- WHAT'S HAPPENING IN STHLM? was held at the Klarna Headquarters in Stockholm. The event was very successful and (according to the organizer) unexpectedly crowded (about 90 attendees) of passionate practitioners and, more generally, of people interested in big data (like myself). Although I could not attend the socialization slots before and, above all, after the event at the bar, it was a very informative and enjoyable meeting and I hope that similar events will be held in the future.

Review: Creating Corpora With Active Learning

PhD thesis reviewed by Marina Santini Fredrik Olsson, Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora. Doctoral thesis, University of Gothenburg, 2008 Download thesis from this page: The PhD thesis "Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora" by Fredrik Olsson contains 13 chapters and an appendix with the base learner parameter settings. The Introduction unfolds the problem and the argument, and the remaining 12 chapters describe the Background (Part I, Chapters 2-5), presents the BootMark method ( Part II, Chapter 6), test the proposed method (Part III, Chapters 7-12) and summarize findings, experience, and viable future directions (Part IV, Chapter 13). The thesis describes a bootstrapping method for named-entity recognition based on active learning — BootMark. The

Reblogging: Gavagai! Gavagai!

Source: Follow the Data Blog — Follow the Data podcast, episode 1: Gavagai! Gavagai! by Mikael Huss Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai! This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company's CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting. Some interesting tidbits from the conversation: The name "Gavagai" comes from a thought experiment by Quine demonstrating the "indeterminacy of translation". It's also the reason for the presence of the little rabbit on the Gavagai web page. Olsson describes Ethersource as a "semantic processing layer of

Review: The Word-Space Model Revisited

PhD thesis reviewed by Marina Santini The Word-Space Model by Magnus Sahlgren, Doctoral Thesis in Computational Linguistics at Stockholm University, Sweden 2006 Available online at: Contents and Research Questions The PhD thesis The Word-Space Model by Magnus Sahlgren contains 16 chapters, namely an Introduction and 15 chapters distributed into four parts. Part I (Chapters 2-4) presents the theoretical background, Part II (Chapters 5-7) contains the theoretical foreground and is Sahlgren's main original contribution, Part III  (Chapters 8-15) describes the experiments and finally Part IV (Chapter 16) where research is summarized and conclusions are drawn. Most chapters start with a citation. Most citations are from The Simpsons. The main research question around which the thesis is constructed and structured is: what kind of semantic information does the word-space model acquire and represent? The answer is

Reblogging: Big Data Week

A good week for (big) data (science) Source: Follow the data – A Data Driven Blog, Posted by Mikael Huss, 10 March 2012 Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the "Two Most Important Algorithms in Predictive Modeling Today" session), I devoted this week to more big-data/data-science things than usual. Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I'll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the

Reading Suggestion: Adjectives and adverbs as indicators of affective language for automatic genre detection (2008)

Rittman, Robert and Nina Wacholder. (2008). Adjectives and adverbs as indicators of affective language for automatic genre detection. Proceedings of AISB 2008 Convention, Symposium on Affective Language. Aberdeen, Scotland, April 1-2, 2008. Abstract. We report the results of a systematic study of the feasibility of automatically classifying documents by genre using adjectives and adverbs as indicators of affective language. In addition to the class of adjectives and adverbs, we focus on two specific subsets of adjectives and adverbs: (1) trait adjectives, used by psychologists to assess human personality traits, and (2) speaker-oriented adverbs, studied by linguists as markers of narrator attitude. We report the results of our machine learning experiments using Accuracy Gain, a measure more rigorous than the standard measure of Accuracy. We find that it is possible to classify

