On Thursday, September 6, 2012 the first meetup on BIG DATA & PREDICTIVE MODELING- WHAT’S HAPPENING IN STHLM? was held at the Klarna Headquarters in Stockholm. The event was very successful and (according to the organizer) unexpectedly crowded (about 90 attendees) of passionate practitioners and, more generally, of people interested in big data (like myself). Although I could not attend the socialization slots before and, above all, after the event at the bar, it was a very informative and enjoyable meeting and I hope that similar events will be held in the future.
The meetup was organized by Mikael Hussain, who is Head of Applied Analytics at Klarna, a Swedish-based e-commerce company that provides payment solutions for online stores. Klarna’s winning idea is to let consumers pay after delivery of their goods, and thereby create a trustful process that lets customers verify their identities using basic personal data. Apparently, about 20% of all e-commerce sales in Sweden goes through Klarna [Wikipedia]. Klarna success story is pleasantly narrated here.
In his short introduction Mikael stressed that the aim of the meeting was to know more about the different approaches applied to big data to make reliable predictions. Predictive modelling is fundamental in e-commerce. For instance, Klarna must make a automated decision about the trustworthiness of a customer for each transaction in a few seconds. Mikael’s introduction was followed by seven 10 minutes-long presentations.
The first speaker, Fredrik Olsson, presented “Understanding and Big Data: Ethersource as the semantic processing layer in the Big Data Stack“. Fredrik — who is a computational linguist holding the position of Chief Data Officer at Gavagai, the spin-off that has recently gained the 2nd position in the list of the Swedish most successful digital entrepreneurs in 2012 — briefly described Ethersource. Ethersource is a technology that “tracks relations between terms and symbols in streaming language data”. Ethersource can handle 1000+ in 10 languages and looks at the language in terms of “attitudes”, i.e. the positive or negative feelings creeping in social media. He showed the kind of data Ethersource can handle and the difficulties connected with the rocketing increase of new words, acronyms, onomatopoeic expressions and graphic symbols pervading human communication on social networks –especially on Twitter. Remarkably, Fredrik was the only presenter who showed the “actual data” that Gavagai technology is fed with, i.e. the natural language in “a state of flux” of tweets, blogs and other social media .
In his presentation “Learning from prediction contests“, Mikael Huss — who together with Joel Westerberg is managing a blog that I like called “Follow the Data” — emphasized the importance of taking part in data-based competitions, such as those organized by Kaggle, one of the several prediction contest platforms. The valuable benefits deriving from contest participation are: 1) getting knowledge and familiarity with diversified data; 2) competing and learning from peers 3) winning prizes, if ranked among the first three in the result list.
Josef Lindman Hörnlund — who works in the Applied Analytics group at Klarna — presented “Quick and not so dirty data science with Random Forests”. In his talk, Josef pointed out that traditional statistics, such as Pearson’s linear correlation coefficient, is virtually unapplicable when handling millions of data points. He suggested a more out-of-the-box way of thinking and made an example of how it is possible to compute an approximate correlation among 8 million data points by training machine learning classifiers (namely, Random Forests) using R. Josef stressed that his approach, although not accurate, is extremely fast (if I remember correctly) takes only 10 seconds to make a decision with acceptable approximation.
Amin Jalili, who is a PhD student at Department of Computer and Systems Sciences DSV, University of Stockholm, presented “Process mining“, a new area devoted to process discovery.
Erik Zeitler, currently working at Klarna, talked about “Massively parallel stream processing” and presented the results of his research when he was a PhD student at Uppsala University with Tore Risch (full paper). Erik first stressed how important it is to find a good way to split information stream, and then he showed how his customizable stream splitting returns exceptionally good results.
Carl-Rickard Häggman, from iZettle, talked about “Big data from a startup perspective”. iZettle offers services within the payment industry. Interestingly, Carl-Rickard said that from his perspective “revolution is not about the transaction fees, but about the data you collect”. Even more, he added that “it is the diversity of data and the creativity to make sense of it” that give the added value to the business. He described how — not having a big amount of data yet — iZettle uses an exploration program and relies on “unconventional risk decisions”, such as cutting out the tails of a distribution curve when making decisions.
Last but not least, Johan Petterson, who founded and works at Big Data, presented the analytics of players in his “Hadoop @ king.com” and showed how 2 billion units per day are processed using Hadoop, Hive, MapReduce, Oozie from a log server.
It was an extremely interesting and informative meeting throughout and we all hope that it will be soon followed by other Big Data gathering. If I can express a wish, next time I would to see more examples of data and its structure. Big Data is a quite vague labelling when talking interdisciplinary. I am familiar with Fredrik Olsson’s data because I am also a computational linguist. But, for instance, what kind of data is stored in Johan Petterson’s log server or what kind of “data diversity” is wished for by Carl-Rickard?
If I can dare, I would suggest creating a web space where it is possible to share data with the others. I know there are always objections and problems with confidentiality, industrial secrets, etc. However, data can be easily sanitized and purged from confidential info. Usually data collections can be used without any problems for research purposes. And I guess any of you is carrying out some kind of research when working on a new data model.
Sharing data would guarantee not only cross-fertilization, interdisciplinarity and quicker advances in the Big Data world, but also cross-validation of existing models, tests of the models’ robustness, interoperability, etc.
What do you think?
Thanks Mikael and Klarna for organizing this event!