White Paper: Text Analytics and Genre Awareness to the rescue of BI & CEM

Text Analytics and Genre Awareness to the rescue of Business Intelligence (BI) & Customer Experience Management (CEM)

by Marina Santini, Copyright © 2012, All rights reserved.

Citation:  Santini, Marina (2012). Text Analytics and Genre Awareness to the rescue of Business Intelligence (BI) & Customer Experience Management (CEM). [White paper]. Retrieved from http://www.forum.santini.se/2012/02/white-paper-text-analytics-and-genre-awareness-to-the-rescue-of-bi-cem/

Business Intelligence and Customer Experience Management

Generally speaking, business intelligence (BI) refers to computer-based techniques used to analyze business data, such as sales revenues, relational database reports, etc , mainly through standard statistical packages. BI’s main aims are to support better business decision-making and planning. Many statistical packages are available to make sense of numbers and structured data, e.g.  SAS, SPSS, COGNOS, etc.

However, businesses create a huge amount of valuable information in the form of emails, memos, notes from call-centers, news, user groups, chats, reports, tweets, facebook pages, blogs, forums, marketing material and so on. More than 80% of all business information exists in these forms (source) or I would say “genres“.  These genres — i.e, these types of text — contain strategic but often unexploited or underexploited information. The information contained in these web genres or digital genres is called unstructured data. The exploitation of unstructured data is recognized as a major unsolved problem in the information technology  industry that engenders a huge economic loss and poor decision-making.

Similar problems are faced with data coming from Customer Relations and Customer Experience Management (CEM). Especially, now that many companies have decided to enhance user engagement and loyalty, or track feedback through tweets and Facebook microblogging. One example for all is the winning strategy adopted by American Airlines and their loyalty program, AAdvantage, based on both Twitter accounts and Facebook pages to communicate with customers.

Text-based techniques — called data mining, text mining, text analytics, social media analytics, social media mining, topic detection, content discovery, content analytics and so forth–  are being  used to cope with this avalanche of unstructured data. Let’s use the term “text analytics” here as an umbrella term covering all these techniques.

Many text analytics commercial solutions (a short list is here) help clients and customers to make educated decision based on unstructured data analysis by extracting textual insights and organizing results in categories that are meaningful to specific business contexts.

The common tasks of a text-analytics effort typically include:

  • Creation of a corpus, i.e. collecting or identifying a set of textual materials, on the web or held in a file system, database, or content management system, for analysis.

Extraction of one, some or all of the following information:

  • Named Entity Recognition
  • Features, such as telephone numbers, e-mail addresses, quantities (with units), etc.
  • Coreference, i.e tje identification of noun phrases and other terms that refer to the same object.
  • Relationship, fact, and event extraction: identification of associations among entities and other information in text
  • Sentiment analysis, i.e. the identification of subjective or factual text and the extraction of various forms of attitudinal information: affect, sentiment, opinion, mood, and emotion.
  • More…

Current text analytics solutions are a valuable step forward, indeed. However there are still many problems when dealing with unstructured data. For this reason, results are often disappointing and substantial adjustments and lots of heuristics are often needed to make sense of the data.

How genre can help

The concept of genre (a.k.a. text type, type of text, text typology, type of document, etc.) is widely acknowledged in many disciplines (Corpus Linguistics, Computational Linguistics, , Language Technology, NLP, Information Retrieval, Informaton Extraction, Sentiment Analysis, etc.) because information is organized in a different way according to the different types of texts. Essentially, the genre of a document has a bearing on the identification of relevant content. For instance, short texts like FAQs, contain short sentences and simple syntax and it is easy to extract the main information. Chats might be full of ellipses or typos and rely on the situational context, so they can be more difficult to handle. Longer texts like emails follow a more formal content organization, where the “core content” might be preceded by salutations, a short introduction and/or additional elements. FAQs, chats, emails are different genres with their own specificities that need to be taken into account when processing unstructured data.

A couple of genre-aware solutions for text analytics

1)      Data Collection and Preparation (a.k.a. Corpus Construction, Corpus Cleaning, and Representativeness): common technical challenges include development of methods to facilitate streamlining of data management and reuse (cleaning, documentation, standardization). Additional techniques should be worked out for corpus profiling. A primary often missing requirement is the addition of genre matadata. The genre of a document affects the choices of the methodology to be applied for the identification and extraction of content, topic, sentiment, emotion, etc. For example, if you have a small corpus of short genres (say from 10 to 1000 words), such as FAQs, chats, microblog posts, text messages, newswires for topic discovery, you could apply a straightforward approach like the following: remove all the stop words (a.k.a. function words) and compute an n-gram (bigram or trigram) Zipf’s frequency list sorted by descending frequency. All the most frequent and meaningful combinations of content words will be placed at the top of the list. Presumably the top combinations represent the most important topics. Generally, this approach works well for languages like English. No need to apply any probabilistic model, no need to measure the distribution and decide for a  parametric on non-parametric model. Basically, some simple text statistics will do the work. Conversely, if you have big data and complex genres, such as email threads, blog posts and related comments, huge amount of unconnected tweets, characterized by articulated syntax and complex discourse structure (e.g. see the emails included in the Enron email dataset), completely different decisions must be made to reliably carry out topic discovery.

2)      Understanding the community structure and communication context: one of the most important novelties brought by the advent of digital communication, the web and social networks is the idea of customers as individuals who belong to communities. A single individual may belong to multiple communities and those communities may even partially overlap. The concept of genre provide contextual information relevant to the identification of communities. Techniques should be worked out for social network analysis and social behaviour analysis through multivariate statistics, e.g. Principal Component Analysis and Hierarchical Cluster Analysis (e.g. see Paolillo et al., 2010).

The concept of genre can add effectiveness and boost the synergy between text analytics and information technology industry to tame unstructured data.

Marina Santini, Copyright © 2012, All rights reserved.

1 comment for “White Paper: Text Analytics and Genre Awareness to the rescue of BI & CEM

  1. 5 March, 2012 at 07:29

    Follow the discussion on LinkedIn: The WebGenre R&D Group (http://lnkd.in/986VBM)

    Glenn Mungra likes this

    Paul D’Souza • Awesome! I did not have that perspective – thanks for sharing. Tell you what, this subject is catching on. We are actively working with Information Builders to do exactly this — and have a solution — to link unstructured data from all social media networks to structured enterprise data — for the sake of learning more about customers – how they feel, think and act in ref to the the business.

    The big question is this …. bringing in the unstructured data is fine … but then we have found that there is a need to do one level of Data Driven Curation and then another phase of Human Driven Curation — so that the business can take action on what they learn from the outside.

    It is all about getting actionable information and this last 100 yards of the last mile is where you need to tap into the social intelligence of your teams and let them Curate this new found social media data feeds and Collaborate amongst themselves. We seem to have cracked this code and are currently working on this solution for a couple of Fortune 500 clients.

    Fun stuff!!

    Marina Santini • Hi Paul,

    thank you for your positive feedback. I appreciate your enthusiasm and your vision. I will have a look at Zakta and your Curation and Collaboration engine.

    Since I am collecting use cases, feel free to make additional suggestions or to post your thoughts and experiences with users.

    Best wishes
    Marina

    David Chamberlain • Marina,
    As a person who works in text mining and enterprise system and data integration I agree with your comment about the “meaning” of the data that builds the aggregated source data.
    The trick, or a trick that worked for our team, was to spend time with business users to construct taxonomies and other metadata to use on the front-end of the process AND we also included business users to determine what they wanted to get out of the system and the approach. The “what do you want out?”, is the same issue that I’ve used to drive my solution building for 25 years regardless of the tool, methodology, and necessity from the requirements.
    Some people now call this “Agile Development” but in a strange twist I always had a business user engaged on projects from end-to-end. Perhaps not as much as some people think required but it worked for us.
    The other approach we did was to build reusable tools, taxonomies, procedures, dictionaries, and modules so we minimized the starting from “scratch” every time we did a new product, or disease, or customer/patient need that was not being answered or addressed.
    That would give us, the team members, time to dig and “discover” those nuggets of knowledge that we like to talk about. Sometimes we didn’t know what we knew or we didn’t know what we didn’t know. Hope that makes sense.

    That’s my 2-cents!
    David

    Marina Santini • Hi David,

    thanks a lot for you contribution. You pointed out and implemented some aspects that I think are inherently successful:
    1) the close interaction and communication with end users through:
    a) the acquisition of end user “vocabulary” (a.k.a. “user warrant”) and its implementation in taxonomies and metadata (i.e. stable resources)
    b) the effort to comprehend what end users want to take out from data to meet their needs (–> user satisfaction)
    2) the construction of reusable resources, which are designed to be flexible enough, general enough, granular enough to maximally capitalize on them for all kind of customers, projects, new products (re-usability).

    I believe this is the ideal far-sighted approach to rewarding investments.

    Marina

    Marina Santini • However, I do not know, David, the kind of data and problems you are dealing with. I do not know if you have already worked out a user-centered genre taxonomy (that would be great!)

    In many scenarios the concept of genre can help contexutalize information. For instance, search and retrieval systems experience the difficulty of matching a document to a query in the absence of contextual information about the searchers’ intents. Knowing the genre to which a text belongs leads to predictions concerning form, function and context of communication. Hence, genre cues can contribute to information understanding and decision making.

    marina

Leave a Reply

Your email address will not be published. Required fields are marked *

*