Reblogging: Gavagai! Gavagai!

Source: Follow the Data Blog — Follow the Data podcast, episode 1: Gavagai! Gavagai! by Mikael Huss Podcast link: Follow The Data | Episode 1 – Gavagai! Gavagai! This first episode, as has been mentioned before on this blog, is about a Stockholm startup company, Gavagai, which provides a technology platform called Ethersource. We interviewed the company’s CDO (chief data officer), Fredrik Olsson, and the chief scientist, Magnus Sahlgren, and we think it resulted in a very interesting chat, although the sound quality is perhaps not ideal due to our inexperience with podcasting. Some interesting tidbits from the conversation: The name “Gavagai” comes from a thought experiment by Quine demonstrating the “indeterminacy of translation“. It’s also the reason for the presence of the little rabbit on the Gavagai web page. Olsson describes Ethersource as a “semantic processing layer of … Read entire article »

Filed under: reblogging

Reblogging: A freely available, open source taxonomy and autoclassification tool

Clade – a freely available, open source taxonomy and autoclassification tool by Charlie Hull at Flax ( One way to manage digital information is to classify it into a series of categories or a heirarchical taxonomy, and traditionally this was done manually by analysts, who would examine each new document and decide where it should fit. Building and maintaining taxonomies can also be labour intensive, as these will change over time (for a simple example, just consider how political parties change and divide, with factions appearing and disappearing). Search engine technology can be used to automate this classification process and the taxonomy information used as metadata, so that search results can be easily filtered by category, or automatically delivered to those interested in a particular area of the heirarchy. We’ve been working on an … Read entire article »

Filed under: dissemination, reblogging

Reblogging: Practical advice for machine learning

Practical advice for machine learning: bias, variance and what to do next By Mikael Huss at Follow the data ( The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk). His advice here focuses on the concepts of the bias and variance in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always … Read entire article »

Filed under: dissemination, reading suggestions, reblogging

Reblogging: A little tutorial on mapreduce

By Joel Westerberg at Follow the data This is a short tutorial to explain the concept of map/reduce. This tutorial can be executed on a Unix system, like Linux or OS X. We’ll first process the data sequentially and then with parallel mapper tasks. As a simple example we will try to compile a list of prime numbers from some text files containing numbers (some prime, some not) and then calculate the sum of all the primes found. Finding primes can be parallelized and is thus on the map side of the algorithm but calculating the sum cannot and is therefore our reduce function. Let’s first start out with creating some test data that is easy to debug, and small, so it’ll run fast. We’ll do this in a terminal shell … Read entire article »

Filed under: dissemination, reblogging

Reblogging: Informer, Spring Issue

Informer Newsletter of the BCS Information Retrieval Specialist Group Table of Contents Editorial: By Udo Kruschwitz on April 28, 2012 Conference Review: ECIR 2012 Industry Day: By Franco Maria Nardini on April 26, 2012 Book Review: Search Analytics for Your Site: By Tyler Tate on April 26, 2012 Conference review: ECIR 2012: By Claudia Hauff on April 25, 2012 Call for Book Reviews: By Cathal Gurrin on April 25, 2012 Conference Review: ECIR 2011: By Cathal Gurrin on April 18, 2012 The Information Needs of Mobile Searchers: By Tyler Tate on April 6, 2012 Designing Faceted Search: Getting the basics right (pt 2): By Tony Russell-Rose on April 4, 2012 Events spring 2012: By Andy Macfarlane on March 30, 2012 … Read entire article »

Filed under: dissemination, reading suggestions, reblogging

Reblogging: Big Data Week

A good week for (big) data (science) Source: Follow the data – A Data Driven Blog, Posted by Mikael Huss, 10 March 2012 Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual. Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the … Read entire article »

Filed under: reblogging