Mining Query Logs: Query Disambiguation & Understanding through a KB

Marina Santini. Copyright © 2012

Work in progress

Talking about  query logs, Karlgren (2010) points out: “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; […]”.

However, some linguistic problems can be sorted out, for example those related to sublanguage, terminology, multi-word expressions, etc. Interestingly, the use of different sublanguages has been studied by Karin Friberg Heppin in her PhD thesis: Resolving Power of Search Keys in MedEval. A Swedish Medical Test collection with User Groups: Doctors and Patients. Karin highlights how patients (laymen) and doctors (experts) use different vocabulary (or terminology) to indicate the same concept. For example, patients might use the word “painkiller” while doctors may prefer the word “analgesic” to refer to the same treatment. Different sublanguages might require different genres in order to meet users’ needs, presumably patients would prefer documents with simplified language such as patient’s leaflets, while doctors would prefer the specialized language of medical journals and chemical descriptions.

In my view, this kind of problems can be addressed through a dictionary-shaped knowledge base where the different uses of language are stored and continually updated.  I will call this knowledge base “DaisyKB”.

DaisyKB has a flexible XLM structure made of entries and fields.

Here is an classical example of an ambiguous word:

 <ENTRY>bank

            <POS>NN</POS>

            <SENSE_1>

                         <NAME>BANK_NN_1</NAME>

                         <INFLECTIONS>bank,banks</INFLECTIONS>


<DESCRIPTION>Place for money, financial institution, …</DESCRIPTION>


<VARIANTS><VARIANTS>

                         <SYNSET>DEPOSITORY_NN,STOREHOUSE_NN,FUND_NN,RESERVE_NN,SAVINGS_NN</SYNSET>

                       <COLLOCATIONS>BANK_ACCOUNT_NP,BANK_NOTE_NP,CLEARING_BANK_NP, BANK_HOLIDAYS, …</COLLOCATIONS>


<FREQUENT_QUERIES>bank of england, bank of america, bank holidays, etc. </FREQUENT_QUERIES>

                         <ABSTRACT_REPRESENTATION>BANK_NN_1<ABSTRACT_REPRESENTATION>


<TERMINOLOGY>

                                                  <FINANCE>UNIT_BANK_NP</FINANCE>


</TERMINOLOGY>

            </SENSE_1>

            <SENSE_2>

                         <NAME>BANK_NN_2</NAME>

                         <INFLECTIONS>bank,banks</INFLECTIONS>

                         <DESCRIPTION>land along the sides of a river or lake</DESCRIPTION>

                         <VARIANTS><VARIANTS>

                         <SYNSET>SHORE_NN,MARGIN_NN,BORDER_NN,EDGE_NN,SIDE_NN</SYNSET>

                         <COLLOCATIONS> …</COLLOCATIONS>

                         <ABSTRACT_REPRESENTATION>BANK_NN_2<ABSTRACT_REPRESENTATION>

            </SENSE_2>

</ENTRY>

DaisyKB can have fields with: collocations, terminology, companies, domain, frequent co-occurring words, frequent queries containing the keyword, sublanguage, genre, etc. It can be thought as an object-oriented framework where each entry is an object, each field is a method in an object and can be called when needed. This means that an object can be called at different levels of granularity.

DaisyKB can be pre-populated by migrating existing resources and updated using a XML-friendly programming language like XSLT.

Major benefits would be: standardization of scattered recourses, flexibility (it can be updated any time, systematically), consistency, coherence, reduced management, multilinguality, reduced idiosyncrasies and errors, increased efficiency, reusable for other products or activities, it can be open-source and collaborative, it can be built with XML and programmed with XSLT for quick updating and deletion or insertion of new fields.

DaisyKB can be used for indexing, word-sense disambiguation, query disambiguation and analysis, multilingual queries, and lots more… I believe that web search engines, enterprise search engines, domain-specific information systems and other language-based applications could benefit from such a resource.

Thoughts?

Marina Santini. Copyright © 2012

 

5 comments for “Mining Query Logs: Query Disambiguation & Understanding through a KB

  1. 22 May, 2012 at 13:37

    From LinkedIn: The WebGenre R&D Group [http://lnkd.in/aZUNnv]

    Jan Jasik • Can you accomplish desired level without social ontology? Every culture has been adding a nuanced versions of taxonomies… overwhelming. We stand united … on the sidelines…;)

    R. David Weaver • Thanks Marina. Remember the days of D-Base with Pro-Cite and other software which could satisfy most users. Today we enshrine most users and any context, even if misunderstood or just incorrect. Umberto Ecco would be lost in the syntax.

    Best,

    R. David.

    Joe Stafura • This ambiguity problem is vexing in all areas of language research, in the areas of TTS ( text to speech) and ASR ( automated speech recognition ) defining and constraining the lexicon to the domain works best, e.g. Medical terminology for EMR ‘s vs. warehouse management system for logistics.

    In these businesses you can’t achieve effective solution with undefined context, and that is a tricky thing to do outside of defined application corpus.

    Marina Santini • I understand your objections.
    However, I do not object against the creation of a more comprehensive and flexible resource, which — although not perfect and although far from the complexity of human language — can provide more extensive help and can be updated regularly.

    Think about the difference that you might find between a bilingual dictionary written a the end of the XiX century and a recent online bilingual dictionary, rich in corpus-based examples, phraseology and contextual connotations such as colloquial, high brow, deprecated, etc. The latter does not certainly represent exhaustively two different cultures, or two different societies, or the complexity of different semantic uses…. but which one would you prefer to use? which one can provide more hints about the actual behaviour of a word? Since we have better technology nowadays than one century ago, we can “easily” devise a multi-faceted resources, where the facets are the different contexts of use. It might be that my view is closer to distributional semantics rather than an ontological view of the world… I am not sure yet 🙂

    Joe Stafura • My view is also based on semantics, developing semantic spaces for applications is one of our services, the hints we look for are in the meaning of the words and phases within the specific context.

    We do experience multiple cultures everyday and mentally shift across these quite easily in the course of living, in the course of leaning it becomes more difficult.

    R. David Weaver • Marina the concern is the amount of time and energy devoted to all submeaning and the elevation of such. The word “awesome” was a one time an enriching expression; at this point people use it as an adverbial clause, hesitation device, noun, and virtually every other type of meaning. We can develop structures to analyze and present such but a what cost and strain to the user? Not accepting any authority systems [yes the OED is perhaps elitist] puts us in the position of analyzing Cohen Brothers movies where the “Dude” not only refers to Jeff Bridges but virtually anyone in the plot.

    Best,

    R. David

    Marina Santini • Hi David and Joe,
    your concern are justified. But I would go practical … this means that I would start from real-world industry/business/academic projects. Once the standards and the specifications have been agreed upon, at least to a large extent, the starting point is to populate the resource with entries useful to individual projects by reusing existing material whenever possible.

    Once the first phase/release is consolidated, the resource should become open source, so everybody can contribute to it, like in a wiki. The crucial point in my view is to define an exchangeable standard that is easy to use and define specifications that are easy to follow.

    • Joe Stafura
      22 May, 2012 at 16:52

      Dont misread my responses as meaning that the idea isn’t a good one, my comments were to explain how there is already some similar efforts that could provide some guidance as to what has worked and what problems exist.

      Antoinette’s comment is also along the lines of my view on how this can happen, the NELL project here at CMU is a system that could be advanced to create a self categorizing corpus with recent shifts in meanings and uses as they emerge on the Internet in some form.

      We use LSA and PLSA as a tool to these ends.

      • 22 May, 2012 at 18:30

        Hi Joe,

        could you pls send me some documentation and links to prototypes (if any)? It would be great to see your approach in detail.

  2. 22 May, 2012 at 13:52

    From LinkedIn: American Society for Information Science & Technology group [http://lnkd.in/eY75dn]

    Kelli Bragg • I love how flexible the mining structure can be. Since I recently applied for a temporary position creating/updating taxonomies for a recipe-based website, I’m currently contemplating what a taxonomy tree might include for such a site.

    Because I’m just starting out in this field, I’m still learning all the various elements that need to be addressed (combined with ensuring enough appropriate linking for SEO). This article helped me better imagine what a good KB for recipe mining should contain, although ensuring inclusion of all the various cultural nuances associated with food may prove a bit more elusive.

    Antoinette Arsic • Thoughts: This kind of a knowledge base will never be fully built. Instead, I would build a knowledge base by not defining all the different senses of a word, but to extract in context of the language that is surrounding it.

    Marina Santini • @Kelli: culture is always hard to encode because it can be … everything. If one manages to narrow down the focus, I do not think giving culture nuances is extremely difficult. What I have in mind is the approach used by the Longman Dictionary of Language and Culture: when needed, the lexicographers used a grey box called “Cultural Note” where they explain cultural differences. Would this be suitable with recipies?

    @Antoinette: I would not be so pessimistic 🙂 What you suggest is already done by concordancing tools, but this type of co-text is often insufficient and very much dependent on the underlying corpus…

  3. 22 May, 2012 at 13:56

    From LinkedIn: Critical Discourse Analysis

    Kaisa Azriouli • Interesting in that coordination is the part of an object having the possibility to reach different fields as ‘methods’.

Leave a Reply to Joe Stafura Cancel reply

Your email address will not be published. Required fields are marked *

*