How to construct a taxonomy of user’s interests automatically?

What is the best way of constructing a taxonomy to classify **user’s interests** from
unstructured social data that internet users have input?

Have you tried with commercial products? If so, what is your experience with them?
Have you tried with an ad-hoc algorithm? If so, which approach would you recommend?
Do you know any existing taxonomy of user’s interests?

Thanks in advance for your answers.

Cheers, Marina

19 comments for “How to construct a taxonomy of user’s interests automatically?

  1. 18 April, 2012 at 06:52

    Comments from Linguistic Semantics LinkedIn Group:

    Jean François Delannoy
    • I’m trying to do that for my own interests.
    Are you thinking of tracking themes x genres ?
    Possibly other quality parameters?

    Marianna Bolognesi, PhD • I am working at the moment with distributional semantic spaces (or vector spaces, or word space models, eg. Landauer & Dumais 1997 or Sahlgren 2006) and I am trying to build semantic spaces out of metadata associated to pictures on Flickr by the users.
    These models allow you to map meanings according to the similarity parameter. Let me know if you need bibliography tipps or more details on this, I can help if you need.

    Jean François Delannoy • What type of semantics of pictures? Theme, style, precise type of objects?… View of trees, only tall trees, only pines, for foresters, for environmentalists, for painters ?

    Building from word spaces may produce very hybrid thingies.

    Marianna Bolognesi, PhD • Distributional semantics.
    Which means: the more two words tend to appear in the same contexts, the more the two words are similar. (and by context you have todecide what do you mean, cause it can be many different things!)
    At the moment I am analysing tags, which include indeed visual, conceptual and emotional features for each captured episode (each image). For example, given the words for colors (primary and secondary colors: red, orange, yellow, green, blue, purple) and analysing the tags that occurr together with these colors tags across millions of photos on Flickr, turns out that the distribution of the colors resembles a rainbow. In other words, the distribution of tags appearing with red is similar to the distribution of tags appearing with orange. The distribution of tags appearing with orange is very similar to the distribution of tags appearing with red and yellow…and so on, like in a rainbow (or a Brainbow! :-)). This doesn’t seem to happen in word space models based on traditional corpora of written texts, but it does in this word space models based on perceptual (visual) features of episodes (images).
    It’s kind of difficult to discuss this in a forum…i hope what i wrote makes some sense to you! let me know!

    David Eddy • @Mariana –

    > constructing a taxonomy to classify **user’s interests**

    Assuming that “taxonomy” means an ordered hierarchy of terms, my guess is any supposed taxonomy you hope to extract will be highly skewed in a variety of directions.

    Taxonomic order? Probably not so much.

    I look at my two collections of formal keywords (tags) accumulated over the past 20 years. Collection#1 is some 800 tags spread over 10 categories (buckets). Accumulated at the whim/need of the moment.

    Collection#2 begun in 2006 (pretty much based on collection#1, but structured differently based on the experience/familiarity from collection#1) is not easily countable, so I don’t know how many tags have migrated from #1 to #2.

    Useful to me? Absolutely. Ordered? Good enough for me, but I wouldn’t want to defend it in public. Hierarchical?

    Marina Santini • @Marianna: this is exactly what I need to know. Yes, please, if you have more references and results of your own experiments.
    @David: are your collections publicly available. Any chance to look at them?
    @JF: the idea is to infer main interests from actions such as: like a page on FB….

    Jean François Delannoy • input data are: “LIke”s, and navigation
    but you’ll need more modeling about the object of the liking or the target of the navigation
    what does the user like in a page or comment ? In my work, I pinpoint it more precisely

    Marina Santini • @JF: can you please send me your references or experiment description?

    Marianna Bolognesi, PhD • Dear Marina, I’ll present my data in July at the CLC at King’s College, and after that I hope to put together a contribution for the proceedings. So far I don’t have anything written properly, but if you want I can send you some graphs privately with some explanation.
    18 hours ago• Like

    Marina Santini • Ok, Marianna, i will wait for the final version then. Pls keep me posted!
    Good luck with your work

    Larry Smith • I believe to some extent your goal might be a species of the genus “unsupervised topic discovery”, insofar as there has been work to discover the topics in text, although ‘taxonomy’ can imply hierarchy and I don’t recall that as a feature. You might look specifically for papers by R. Schwarz.

    Marina Santini • Thanks Larry!

  2. 18 April, 2012 at 07:03

    Comments from KD2U – Knowledge Discovery in Distributed and Ubiquitous… LinkedIn Group (

    Ina Lauth • Dear Marina,
    under the supervision of Prof. Ernestina Menasalvas from Univ. Politecnica Madrid we have done some research on building semi-autom. a taxonomy based on user’s query behavior which is reflecting their interests on a TV channel portal for example…you can use this as a frame for implementing several text/web mining tools that would catch the user interests in several iterations from different perspectives that you need for your application. Here are two publications. If they are interesting to your reseach, ping me under my linkedIn account and let me know where to send you the electr. form:

    * Maria Valencia, Codrina Lauth, Ernestina Menasalvas. Emerging User Intentions: Matching User Queries with Topic Evolution in News Text Streams, Oct 23, 2008, IPMU (=Information Processing and Management of Uncertainty in Knowledge-Based Systems) Journal. (journal paper)

    * Codrina Lauth, Ernestina Menasalvas. Emerging User Intentions: Matching User Queries with Topic Evolution in News Text Streams, IPMU 2008 in Malaga. (conference paper)

    We should post this discussion under UTMA (=Ubiquitous Text Mining and Analytics) too, there may be more text/semantics-oriented people in there too, that may help you.

    Ina Lauth
    • Contact Prof. Menasalvas directly, she and is the expert in this area and will help you with more references and use cases.

  3. 18 April, 2012 at 07:10

    Comments from UTMA – Ubiquitous Text Mining and Analytics LinkedIn Group (
    Ina Lauth • Thank you for opening this discussion. My answer to this is in the KD2U LinkeIn Group.
    here again the two publications for building the taxonomy framework of user’s interests:

    1.) M. Valencia, C. Lauth, E. Menasalvas. Emerging User Intentions: Matching User Queries with Topic Evolution in News Text Streams, Oct 23, 2008, IPMU (=Information Processing and Management of Uncertainty in Knowledge-Based Systems) Journal. (journal paper)

    2.) C. Lauth, E. Menasalvas. Emerging User Intentions: Matching User Queries with Topic Evolution in News Text Streams, IPMU 2008 in Malaga. (conference paper)

  4. 18 April, 2012 at 07:18

    Comments from TTC: Terminology Extraction, Translation Tools and Comparable… LinkedIn Group (

    Francois Brown de Colstoun
    • Hello Marina,
    We are currently doing that at Lingua et Machina, currently still in a step-by-step manner, hopefully soon in a fully-automatized mode.
    We can run a test for you, just let me know.

    Marina Santini • Thanks Francois! That would be great. Must check whether the data can be shared, first….

  5. 18 April, 2012 at 07:22

    Comments from Semantic Web Analytics LinkedIn Group (

    Sujit Pal • I haven’t actually done one of these myself, but one way would be tracking clicks on pages on your site. Assuming each of these pages correspond to one or more categories, the user’s taxonomy could be his login id (if logged in) or his IP address (if not) mapped to a set of categories for pages he has visited on your site.

    Marina Santini • Aha! Interesting… Thanks Sujit.

    Alexander Osherenko • I would wonder what unstructured social data users can enter. Is it mouse clicks, blogs etc. Moreover, I would wonder what taxonomy you look for — is it a hierarchy of interests?

    Why I ask — we used the sociological theory of Pierre Bourdieu ( in our socionics project ( to model interests as scalar values. I assume that such values can be calculated using your social data.

    Prateek Jain • Just curious how different it will be then some of the work done at Attention Profile Markup Language Even if it is different, it might be a good place to look at for modeling related scenarios and for reusing some of the taxonomy.

    Marina Santini • @Prateek: thanks. I will have a look at it.
    @Alexander: yes, it is likes, blogs, social network actions…

  6. 19 April, 2012 at 08:02

    Comments from Content Management LinkedIn Group

    Jacqui Harris • Go to a company called Firestring from Southa Africa – they have a semantic engine with automated taxonomy based on natural language.
    22 hours ago• Like

    Marina Santini • Thanks for your suggestion, Jacqui.
    Cheers, Marina

  7. 20 April, 2012 at 16:13

    Comments from Information Science and LIS LinkedIn Group

    Maureen Boland • How about creating a word cloud for taxonomy of user’s interests? I know this may be backwards and too simplistic, but it’s visual and quick. I’ve copy and pasted open survey questions into for word cloud. The more a word appears the greater its prominence.

    Sania Battalova • Thank you for starting the discussion, I would be very interested to learn about your experience. Did some one tried – mind mapping tool?

    Marina Santini • @Maureen: yes, why not? I know wordle, but I hadn’t thought of using it this way. You’re right… Possibly removing stop words, I would say. It can give a visual first impression…
    @Sania: I do not know I will give it a try. Thanks for the suggestion. I will summarize my experience later on…

  8. 24 April, 2012 at 09:19

    Comments from The WebGenre R&D Group on LinkedIn (

    James Harbour • I think it would depend on what the data looks like. With respect to something like the “Like/Unlike” flagging that has a bit of traction in the more popular social media sites, there are at least two choices. The first choice might be to associate the “interest/disinterest” with the heading of the item, while the second choice would be to (at first) associate it with the entirety of the item. Of course, when there are external links attached to the items themselves, things get a bit more interesting. Do you mine the link’s title in addition to the original item, or the full content of the link, or nothing at all? And do you run some sort of associative algorithm or method on the body of the item to try and mine for patterns and associate them with the interest/disinterest? And what do you do with items in an individual’s feed that are largely ignored? Do you leave them completely out of your analysis, or do they contain some sort of input that is worthy of consideration? (I think they are external to the exploration effort – some people go for days without looking at their social media feeds – this means that data is simply not seen at all and is therefore statistically misleading if included in the analysis). Sorry for the length of the reply. I am somewhat rushed for time and this is all stream-of-consciousness…

    Richard Creamer • This isn’t a strong area for me, but here are some thoughts which may be relevant:

    Multiple Personas

    • Consider creating a different interests taxonomy for each persona an individual human can assume.
    • A good exercise is to log into and look at their recommender engine’s displayed items. You should clearly see examples of different personas you’ve had in effect during prior browsing sessions such as:

    – Shopping for a professional product (e.g., technical book)
    – Shopping for a personal item (e.g., new camera lens)
    – Shopping for a personal item intended as a gift for a friend)

    • You may want to create a taxonomy of personas
    • An exercise I did last year was to manually catalog the topics/interests in all of my postings (outgoing stream, see below) on a social networking site. (“Hands-on” analysis is sometimes helpful.)

    Stream Direction

    • Consider the “stream direction” when analyzing interests:

    – Topics on which a person makes postings (outgoing stream interests)
    – Topics on which a person adds comments (indeterminate stream direction)
    – Topics of postings a person views for at least several seconds (incoming stream interests)


    • Sometimes a posting on a specific topic is unexpectedly interesting to a person.
    • These sorts of unexpected interest areas are often unpredictable and cover a wide range of topics

    Marina Santini
    • @Richard and @James
    Your questions, suggestions, experience are very inspiring and useful for my pre-study. Thanks for sharing your views with me and with the group members.

    I have collected all the comments from LinkedIn here: (see the the comments to the post).
    Cheers, Marina

  9. 26 April, 2012 at 09:24

    Comments from Text Analytics Group on LinkedIn (

    Scott Tucker • When I hear Taxonomy, I think OWL Ontology. I use the Protege Open Source tool from Stanford for OWL.

    Marina Santini • Yes, right, OWL… I had not thought of it. Thanks, Scott.


  10. 3 May, 2012 at 08:59

    Comments from The Language Technology Group on LinkedIn

    Sabine Buchholz • Hi, that sounds like an interesting problem. Could you elaborate a bit? Can you give an example of “unstructured social data”? How many entries do you have?
    Are you only interested in a fully automatic approach or would something semi-automatic or maybe fully crowdsourced be of interest as well?

    Daniel Gray, MBA • Echoing Sabine’s suggestion…my clients are using Amazon’s Mechnical Turk (MTurk), a marketplace for crowdsourcing/microtasking, to structure data by creating taxonomies, augment data, add metadata, categorize data, etc…MTurk supplements software/computer processing/algorithms with ‘human judgment’ to achieve this type of objective because there’s the 20% that machines just can’t accomplish.

    Marina Santini • Hi Sabine and Daniel,
    the general idea is to infer main interests from actions such as: “Liked” on a page on FB, and similar.
    At this stage, I am collecting material for a pre-study, so all suggestions are welcome. The pre-study originated by the following request:
    “Ideal output might be “young educated urban woman who likes restaurants and winter sports”. Perhaps that’s distilled from location data, likes for restaurants, likes for ski resorts or ski equipment companies. I imagine that age, location, gender, and possibly education level are fairly easy for us to determine, i.e. the data might not be that unstructured to pose too much of a problem determining its meaning. But a like for, a ski resort here in Northern California, is a bit more tricky. If I lived in Stockholm, I might not know that is a ski resort. ”

    @Daniel, how would you use MTurk to extract this kind of interests? Have you some previous experience with this type of data?

    @Sabine: Both fully-automatic approaches and semi-automatic methods are ok. The final choice will depend on a combination of time-accuracy-quality.


    Sabine Buchholz • OK, I’m just brainstorming here, but if the main source of data is “likes” and these always refer (correct me if that’s wrong) to websites, then I would start by retrieving the website (first page or more) and do unsupervised clustering of those documents. Then you could manually (e.g. using crowdsourcing) assign “interest labels” to clusters. You will likely have to play around with the exact clustering algorithm/parameters, so to evaluate any run, you could measure how much agreement there is between people assigning interest labels to clusters: more agreement probably means that the clustering worked more like you wanted. This could bootstrap you a taxonomy.
    Of course this assumes you don’t have a taxonomy to start with. Have you explored whether you could derive a suitable one from Wikipedia? Seems to me that Wikipedia very much reflects people’s interests 🙂
    To pursue your example: Wikipedia has an entry which also appears in, and the article mentions “ski” and “snow” several times.

    @Daniel: Would like to hear more about your approach.

    Marina Santini • Thanks, Sabine!

  11. 3 May, 2012 at 09:03

    Comments from Semantic Web on LinkedIn (

    Kelly Hatfield • “The DBpedia knowledge base currently describes more than 3.64 million things, out of which 1.83 million are classified in a consistent Ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organisations, 183,000 species and 5,400 diseases”

    dbpedia’s class heirarchy:

    Depending on the scope of the interests you are trying to capture, this might be a place to start – books, movies, games, and organizations (including sports teams) seem like pretty common interest categories.

    Magnus Knuth • That’s a very interesting topic. For modeling topic hierarchies I would consider SKOS as the vocabulary to choose. There are some topic hierarchies available in SKOS, e.g. the Library of Congress Subject Headings (LCSH), the german Standard-Thesaurus Wirtschaft (STW), the GEMET thesaurus or IPTC. For more general domains YAGO might be fitting, which bases on Wikipedia categories and Wordnet. So far, I could not find evaluations of the suitability of such taxonomies for user interest classification. Since I am interested in research in this field, drop me a line if you are doing a project related to this.

    Marina Santini • Hi Kelly and Magnus,
    thanks a lot for your suggestions. It would be handy to start from some pre-built hierarchies.
    If you had the following request:
    “Ideal output might be “young educated urban woman who likes restaurants and winter sports”. Perhaps that’s distilled from location data, likes for restaurants, likes for ski resorts or ski equipment companies. I imagine that age, location, gender, and possibly education level are fairly easy for us to determine, i.e. the data might not be that unstructured to pose too much of a problem determining its meaning. But a like for, a ski resort here in Northern California, is a bit more tricky. If I lived in Stockholm, I might not know that is a ski resort. ”

    How would you apply existing taxonomies to social data to extract this kind of interests? Inference? Graphs? Semi-supervised classification? more?

    Cheers, Marina

    Alfredo Serafini • This is a really interesting topic.
    I think that the semi-supervised approach is interesting if and when you want to expose a sort of meta-taxonomy, which in some manner superseed or imply group of users’ interests.
    18 hours ago• Like

    Marina Santini • Thanks, Alfredo.

  12. 4 May, 2012 at 12:40

    Comments from Applied Linguistics on LinkedIn (

    Alla Sobirova • Can we make a comprison/analogy with Maslow pyramid?

    Marina Santini • Yes, why not? it would be a good idea to give it a try. Thanks, Alla! Cheers, Marina

  13. 4 May, 2012 at 12:48

    Comments from Data Scientists on LinkedIn (

    Prateek Jain • Is it possible for you to share what is the objective? What do you plan on doing with this taxonomy. You probably have enough suggestions by now, but how about hooking up the interests which are identified, to an existing taxonomy. For example, if user is interested in “Soccer Clubs”, then finding where it resides in lets say . It will get hooked into a whole bunch of terms like this

    Though this is very broad, but it is also good as it captures the interest from different contexts.

    Marina Santini • Hi Prateek,
    I would like to keep the goal undefined for the time being. I would like to have a broad views of the possible solutions first, and then work on a number of use cases.
    Your suggestion is interesting. Thanks a lot, Prateek.
    Cheers, Marina

  14. 4 May, 2012 at 12:55

    Comments from Enterprise Architecture: Tactical . Strategic . Visionary on LinkedIn

    Jan Jasik • (copy) Enterprise Architecture is about methods that allow capture of ambiguity… in case of users interest, first step would be to capture understanding through taxonomy…(?), so one of the end-games could be capture of intentions through ontology…? Would that allow to express a semantic universe, a part of universe’s consciousness… by aggregating ontologies?

    Marina Santini • Hi Jan, I like the questions you asked, but what are your answers to them?
    Cheers, Marina

    Jan Jasik • I have been waiting for an evolution in ontology (not my field) to something multidimensional as torus (visually). It should reflect time space continuum. As SOA, EA in general, one benefits from an ecosystem construct. This way you could leverage governance, discovery… user’s privacy. Practical however, somewhat limiting role, could be given to a user’s profile (a pointer reference) by collecting user’s choices and generating a unique template (taxonomy) powered by a predictor’s pattern (ontology). Questions: how unique user would remain? Would you view a user’s discovery (unique identity) across federated systems or entire ecosystem (enterprise)? And the final issue of user’s interests management, privacy, … opting out from that discovery…?

  15. christophe clugston
    8 February, 2013 at 07:53

    There are many problems with trying to connect interest from FB to actual conversion (buying). FB is a very low percentage conversion market. In fact, besides the social media hype pushers (like the dt come banner ad days) it is a HORRIBLE ROI.

    The safest way to understand what interests are (and I am speaking form a internet business paradigm) is to see what lists the person has OPTED into. You can also try keyword and wheel search. (Business has the clearest actual goal/performance connection, btw.)

  16. 8 February, 2013 at 20:59

    What is a “wheel search”, Chis?

  17. christophe Clugston
    15 February, 2013 at 11:28

    This system never notifies me of responses–thus, the late reply. Google Wheel search it shows the pattern of how people get to a certain page. Internet Marketers use it.

    • 18 February, 2013 at 22:06

      Hi Chris, did you subscribe to new comments’ notifications? Thanks about Google Wheel.

  18. 22 March, 2013 at 23:31

    Yes I have marked it on all posts I have replied to

Leave a Reply

Your email address will not be published. Required fields are marked *