Meetup Report: Big Data & Predictive Modeling – What’s happening in Sthlm?

On Thursday, September 6, 2012 the first meetup on BIG DATA & PREDICTIVE MODELING- WHAT’S HAPPENING IN STHLM? was held at the Klarna Headquarters in Stockholm. The event was very successful and (according to the organizer) unexpectedly crowded (about 90 attendees) of passionate practitioners and, more generally, of people interested in big data (like myself). Although I could not attend the socialization slots before and, above all, after the event at the bar, it was a very informative and enjoyable meeting and I hope that similar events will be held in the future.

The meetup was organized by Mikael Hussain, who is Head of Applied Analytics at Klarna, a Swedish-based e-commerce company that provides payment solutions for online stores. Klarna’s winning idea is to let consumers pay after delivery of their goods, and thereby create a trustful process that lets customers verify their identities using basic personal data. Apparently,  about 20% of all e-commerce sales in Sweden goes through Klarna [Wikipedia]. Klarna success story is pleasantly narrated here.

In his short introduction Mikael stressed that the aim of the meeting was to know more about the different approaches applied to big data to make reliable predictions. Predictive modelling is fundamental in e-commerce. For instance,  Klarna must make a automated decision about the trustworthiness of a customer for each transaction in a few seconds. Mikael’s introduction was followed by seven 10 minutes-long presentations.

The first speaker, Fredrik Olsson, presented “Understanding and Big Data: Ethersource as the semantic processing layer in the Big Data Stack“. Fredrik — who is a computational linguist holding the position of Chief Data Officer at Gavagai, the spin-off that has recently gained the 2nd position in the list of the Swedish most successful digital entrepreneurs in 2012 — briefly described Ethersource. Ethersource is a technology that “tracks relations between terms and symbols in streaming language data”.  Ethersource can handle 1000+ in 10 languages and looks at the language in terms of “attitudes”, i.e. the positive or negative feelings creeping in social media. He showed the kind of data Ethersource can handle and the difficulties connected with the rocketing increase of new words, acronyms, onomatopoeic expressions and graphic symbols pervading human communication on social networks –especially on Twitter. Remarkably, Fredrik was the only presenter who showed the “actual data” that Gavagai technology is fed with, i.e. the natural language in “a state of flux” of tweets, blogs and other social media .

In his presentation “Learning from prediction contests“, Mikael Huss — who together with Joel Westerberg is managing a blog that I like called “Follow the Data” — emphasized the importance of taking part in data-based competitions, such as those organized by Kaggle, one of the several prediction contest platforms. The valuable benefits deriving from contest participation are: 1) getting knowledge and familiarity with diversified data; 2) competing and learning from peers 3) winning prizes, if ranked among the first three in the result list.

Josef Lindman Hörnlund — who works in the Applied Analytics group at Klarna — presented “Quick and not so dirty data science with Random Forests”. In his talk, Josef pointed out that traditional statistics, such as Pearson’s linear correlation coefficient, is virtually unapplicable when handling millions of data points. He suggested a more out-of-the-box way of thinking and made an example of how it is possible to compute an approximate correlation among 8 million data points by training  machine learning classifiers (namely, Random Forests) using R. Josef stressed that his approach, although not accurate, is extremely fast (if I remember correctly) takes only 10 seconds to make a decision with acceptable approximation.

Amin Jalili, who is a PhD student at Department of Computer and Systems Sciences DSV, University of Stockholm, presented “Process mining“, a new area devoted to process discovery.

Erik Zeitler, currently working at Klarna, talked about “Massively parallel stream processing” and presented the results of his research when he was a PhD student at Uppsala University with Tore Risch (full paper). Erik first stressed how important it is to find a good way to split information stream, and then he showed how his customizable stream splitting returns exceptionally good results.

Carl-Rickard Häggman, from  iZettle, talked about “Big data from a startup perspective”. iZettle offers services within the payment industry. Interestingly, Carl-Rickard said that from his perspective “revolution is not about the transaction fees, but about the data you collect”. Even more, he added that “it is the diversity of data and the creativity to make sense of it” that give the added value to the business. He described how — not having a big amount of data yet — iZettle uses an exploration program and relies on “unconventional risk decisions”, such as cutting out the tails of a distribution curve when making decisions.

Last but not least, Johan Petterson, who founded and works at Big Data, presented the analytics of players in his “Hadoop @” and showed how 2 billion units per day are processed using Hadoop, Hive, MapReduce, Oozie from a log server.

It was an extremely interesting and informative meeting throughout and we all hope that it will be soon followed by other Big Data gathering. If I can express a wish, next time I would to see more examples of data and its structure. Big Data is a quite vague labelling when talking interdisciplinary. I am familiar with Fredrik Olsson’s data because I am also a computational linguist. But, for instance, what kind of data is stored in Johan Petterson’s log server or what kind of “data diversity” is wished for by Carl-Rickard?

If I can dare, I would suggest creating a web space where it is possible to share data with the others. I know there are always objections and problems with confidentiality, industrial secrets, etc. However, data can be easily sanitized and purged from confidential info. Usually data collections can be used without any problems for research purposes. And I guess any of you is carrying out some kind of research when working on a new data model.
Sharing data would guarantee not only cross-fertilization, interdisciplinarity and quicker advances in the Big Data world, but also cross-validation of existing models, tests of the models’ robustness, interoperability, etc.

What do you think?

Thanks Mikael and Klarna for organizing this event!

Marina Santini

6 comments for “Meetup Report: Big Data & Predictive Modeling – What’s happening in Sthlm?

  1. Mikael
    7 September, 2012 at 12:32

    Good writeup! I agree with your points. It was a pity that I didn’t get the chance to talk to you although I noticed you were sitting across the table – due to the packed schedule, the socializing had to wait for the pub afterwards.

  2. 7 September, 2012 at 14:57

    Great writeup! and I’m sorry too that I didn’t have the time to chat, next time hopefully! It would be fun to look at data structures perhaps next time I can do a presentation on something like that.

  3. Jeannine
    7 September, 2012 at 16:15

    Terrific write-up! Thanks for sharing background on the many interesting companies and projects.

  4. 26 September, 2012 at 09:06

    Discussion from Data Mining, Statistics, and Data Visualization ( — LinkedIn Group

    Oleg Okun • Thank you, Marina, for the excellent and detailed report! It is good that Big Data and Predictive Analytics came to Sweden but both customers and service providers only reside in Stockholm/Gothenberg (first and second most populated cities in Sweden). The pace of adopting analytical methods by Swedish companies that possess a lot of data is too slow. Too many miss too much. Except Klarna, currently, game-developing Swedish companies dominate among early adopters of data mining and big data mining. Nothing is heard from retailers, telco, banks, and insurance companies, though at least the last two categories of enterprises must rely on analytics in their work. Retailers and telco could get enormous profits from using advanced analytical methods and yet, it seems they are waiting for something.

    Marina Santini • I completely agree with you, Oleg!

    Dario Galimberti • hello Ms Santini,

    i read your meeting report with my interest.

    I was interested in “Quick and not so dirty data science with Random Forests”. You wrote:”In his talk, Josef pointed out that traditional statistics, such as Pearson’s linear correlation coefficient, is virtually unapplicable when handling millions of data points. He suggested a more out-of-the-box way of thinking and made an example of how it is possible to compute an approximate correlation among 8 million data points by training machine learning classifiers (namely, Random Forests) using R. Josef stressed that his approach, although not accurate, is extremely fast (if I remember correctly) takes only 10 seconds to make a decision with acceptable approximation.”

    I had the same interest but with a different goal: forecasting precisely.
    So with random forests we may have: fast speed vs. scarce approximation.
    Not so useful in forecasting precisely.
    Thank you for your attention.

    Oleg Okun • Hello Dario,

    The following short paper may help you:

    There random forest was used for feature selection only. After that, one can use the selected features in a different algorithm better tuned to precise forecasting.

    Dario Galimberti • hello Oleg,

    thank you for your link.
    In fact, i would have liked to analyze the two algorithms to see the differences.
    The question may be: using stochastic discrimination, is it possible to forecast precisely?.
    I am open to listen opinions.
    Thank you.

    Oleg Okun • Hello Dario,

    I didn’t try stochastic discrimination though I have heard that this ensemble technique can provide high accuracy. If you seek very precise algorithms, I think you may approach to your answer from the bias-variance dilemma point of view. By “very precise”, you likely mean both low bias and variance. In practice, for many algorithms it is difficult to lower both characteristics at the same time, though it seems that variance is easier to reduce than bias.

    Another point in your quest is that it might be that the precise forecast is less important than the precise trend (positive, negative). For instance, if you know with a high confidence that shares of a certain company are going to rise in value next week, then you can safely emit a ‘buy’ signal. Of course, profits will differ, depending on how many shares one bought, but they will be, anyway, profits, not losses.

    Riku Lappi • About outflyers and precise forecasting….
    Part of the varible space is just impossible. How to eliminate it?
    1) check Gary Horne and datafarming. He has published common sense approaches with forcefull alghorithms to eliminate impossible parts of variable space in order to concentrate at the reality
    2) False identifaciotion of improbable varible combinations as highflyers is both common and dangerous. Perhaps the most well known, the most costly in the last 5 years (subprime launched crisis) and the still the most widely used way to get it totally wrong is to
    a) assume, that the probability distribution of a phenomenon follow some beautiful, simple pattern like the normal distribution.
    b) assume without proof that variables are independent of each other, when they may be. In retrospect it is easy to conlude that the ability of a Detroit family (father in car factory, mother a partner in a restaurant near a major car factory) to pay mortages is dependent of many interconnected variables, like Middle East security and its effect on oil price etc.

    Hence, an extensive Monte Carlo approach to simulation of a multivarible problem is waste of time. On the other hand, you cannot describe humans using just the most elegant mathematics. How to approach the problem?

    Take ladies in a city downton for example:
    – high heels to make legs look longer
    – push-up bras or implants to make breasts look bigger
    – make-up to camouflage the natural face
    – optical tricks in clothing to make the dimensions of the body look different from real

    When enough energy is used to fool the observer, observations get biased. That is my point concerning Precise Forecasting of human behavior like business.

    Riku Lappi • In short: please help me
    Any ideas how to grasp the esence of a multdimensional human related problem. First eliminate the impossible.Next step?
    Can you suggest reading or other sources for ideas?
    I have experince in pharmaceuticl industry (hard science + big business image/brand building)
    fire & nuclear plant probability based risk analysis (hard science + big emotions)

    I need tools for comparing optional solutions in a multidimensional variable space. Where to start! Help!

    Dario Galimberti • hello Oleg,

    thank you for your ideas. About them, if i may, i would like to add some points.
    I wished to underline that “stochastic discrimination” may have a dual-use… helping your forecast or creating problems. It may depends on the used algorithms and the nature of the situations you have to analyze.
    I may agree about the bias-variance dilemma. Even if they may be linked.
    In finance, where i use forecasting models, well… it is better to use dynamic models. It is possible to use different lag dynamic models… so trends quite easy to determine… maximum and minimum points ( supports and resistances ) quite easy to determine. Anyway… scalping techniques are different than “not scalping” techniques.
    In finance, the big problem is to insert the “psychological behavior” variable in models.
    In other term, or other situation, there was another challenge with friends about forecasting… the question was: is it possible to forecast lotteries ( or roulette ) using statistics and mathematics?.
    Most of statisticians and mathematicians said “no”. Some may say “yes”.
    In general term: may you forecast an aleatory event?.
    This was the problem… in general term.
    Interesting in listening other opinions.
    Thank you.

    Dario Galimberti • hello Mr Riku Lappi,

    about the esence of a multdimensional human related problem.
    My experience…
    You may have this situation: F(a,b,c,d,e…) = 0.
    You do not know anything about the variables. You know nothing about the functional relations about the variables.
    You have only numerical outputs… i mean numerical data.
    You must estimate the function F.
    It is necessary to determine the relevant variables. It is necessary to determine the functional relations among the variables. Which independent and which ones dependent. If dependent… other works to search other possible variables.
    Your question may be: which statistics to use?.
    We may have a lot of possibilities and approaches.
    Thank you… and in stand-by…

Leave a Reply

Your email address will not be published. Required fields are marked *