The Path Forward: From Big Unstructured Data to Contextualized Information

How can we convert massive quantities of unstructured data to structured information? What kind of “structure” do we need for a reliable interpretation of this undomesticated data? I suggest thinking of a text-analytic framework based on “context”.

Search keywords, events, entities, sentiments, attitudes, polarities, opinions etc. have a different weight and require a different assessment depending on the kind of texts, the situational context, the  field of discussion, and the authority of the source, as well as on the purpose of use. For example, for an official use, factual texts might have more credibility than opinionated texts. In this respect, press conferences, declarations or announcements by a White House spokesman might be more reliable than newspapers’ speculations or op-ed articles. On the contrary, if we want to test the pulse and explore the feelings about a product or a politician, we might give more weight to blogs, forums or social networks’ microposts.

I claim that genre, sublanguage and domain are invaluable but unexploited textual dimensions that can be used to extract the communicative context (and descriptive metadata) needed to assess the value of information with respect to a purpose (business, learning, findability, monitoring, predicting, etc.).

In my view:

  • Genre gives us the compositional context. When we know the genre of a document, we know how the content is organized, we know where we can find the most important information. For instance, on the web when the genre of a digital text is unknown or not declared explicitly, users feel often at a loss and do not know how to assess how reliable, objective or useful information is. The same is true within business intelligence, customer care optimization, and in many other practical applications.
  • Sublanguage provides a situational context influenced by the medium of communication (e.g. telephone, face-to-face, chats, video-conferencing, microblogging, etc).  Sublangage has nothing to do with terminology (specialized words, aka terms, used in a specialized domain); sublanguage is not register (e.g. cues of formality, casual conversation, etc.);  sublanguage is not style. Think of the sublaguage characterizing tweets and the sublanguage used in customer care help centers or chats. They can be enormously different, though they might all be informal, conversational and polite. Sublanguage is formulaic, cross-topical and mostly domain-independent. For instance, the sublaguage used in a car rental help center is similar to the sublanguage used in a first-aid call center. In both cases, there will be a salutation (e.g. good morning), investigation (e.g. How can I help you? When did this happen? Where are you now?), personal detail requests (e.g. what is your name?), and similar.
  • Domain refers to a field of interest or to a subject matter. It can be medicine, politics, marketing, literary criticism, etc. A domain can have a specific terminology.

Together these three textual dimensions help contextualize big unstructured data. Essentially, this means that linguistic pre-processing, textual analysis and the understanding of how human communication works on different media and across different types of text has a strong bearing on the quality, reliability and usability of “big data”.

Which text-analytic tools can presently contextualize information? My guess is that we should start working for and move forward to next generation of context-aware analytic tools.

Marina Santini, Copyright © 2012, All rights reserved.

10 comments for “The Path Forward: From Big Unstructured Data to Contextualized Information

  1. Fred
    30 March, 2012 at 06:59

    Hi Marina, I like computational linguistic and have been working on it for several years. I find your theory on “genre, sublanguage and domain” interesting, I believe additional proof, like case study would make it more convincing. I’d be very glad if you could talk more about it. Thank you!

    • 30 March, 2012 at 18:15

      Hi Fred,

      I am working on case studies. Do you have anything you can suggest from your own experience?

      Cheers, Marina

  2. 30 March, 2012 at 13:41

    Very good text! 🙂

  3. 28 May, 2012 at 18:01

    Hi Marina

    I’m a Biologist and my interest is in the importance of context in AI.
    My company MyEdo(NI) has recently applied for a patent for a device that extracts info and adds context to unstructured data. It is a tool that has big importance for search and big data organization on every level.
    Contextualizing unstructured data opens up a lot of doors for cutting edge analytics and for integrating the data into human systems.

  4. 29 May, 2012 at 09:16

    Hi Robert,

    I am very glad that you agree on the practical importance of framing unstructured data with a context. If you have any public documentation you can share, I will very happy to read it.

    Best wishes,


    • 31 May, 2012 at 00:13

      Hi Marina

      A very interesting publication is the recent white paper by Fujitsu titled

      ‘Linked Data – connecting and exploiting big data’

      it is very insightful regarding the importance of context in data and considers the importance of linked data in achieving the contextualization of unstructured data.
      Personally, for quite some time I have been of the opinion that linked data is key regarding contextualization. I tend to think of data organization in terms of biological analogies but me know your thoughts anyway, if you get a chance to read it.



  5. 31 May, 2012 at 06:04

    Hi Robert,

    thanks for the reading suggestion. I will read it with pleasure.

    The linking idea is also developed by Lennart Björneborn in this chapter for a web environment:

    Thanks again

  6. 24 November, 2012 at 01:59

    I absolutely agree on Robert’s idea. Linked Data eventually identifies, collocates and furthers gives interoperability among huge big data. In other words, tries to being order out of choas. Remember this method bring “meaning” through identifying and collocating. This method uses “actionable” metadata rather than flat metadata. Web is getting smarter. ^^

  7. Robert Craig
    2 January, 2013 at 03:57

    Hello Dr Cho

    The Web indeed is getting smarter!
    I think the way forward with ‘unstructured data’ lies not only in linked data but a strong consideration of how that linking is achieved/controlled on a large scale. I’m from Belfast and over the next few weeks will be forging links with the guys at DERI Galway some of whom I’m sure you are familiar with. It would be nice to speak again sometime in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *