Actionable Corpus & Actionable Intelligence

I am trying to figure out how to predict future trends independently from entities.

For example, instead of trying to guess who (Obama and Romney are two entities) will win next American elections, I would like to predict the trend representing Americans’ confidence in a better US economy in years 2012-2017. This is just an example that simplifies my purpose, and it has nothing to do with my actual data.

I would like to start this exploration with predictive methods using the ENRON email dataset (http://www.cs.cmu.edu/~enron/).

I would like to predict – from this huge email corpus (UNSTRUCTURED BIG DATA) – whether and when (a point in the past) the ENRON SCANDAL could be expected to happen.

The ENRON email dataset will be the “actionable corpus” that will be used to experiment on non-entity-based predictions.

An actionable corpus contains unstructured actionable intelligence. Actionable intelligence refers to crucial insights derived from texts that can help make better decisions to avoid dramatic consequences, such as managers’ or stakeholders’ suicides, and similar. I am trying to think in terms of forensic linguistics in this case…

Do you know if similar experiments have already been carried out?

What computational approaches would you suggest for predictioctions and future trends of this kind?

All suggestions are welcome!

Marina

 

9 comments for “Actionable Corpus & Actionable Intelligence

  1. 29 October, 2012 at 14:44

    Dear Marina,

    I can recommend Patricia Dunmire’s book (2011) on discourses of the future. Also work by Vyv Evans refers to cognitive affordances of Time in the construction of realities.
    I am doing research into party positions by analyzing texts for time and space references to find their political (ideologically motivated) deictic centre, from which imagined futures unfold.
    Hope this is useful.

    Kind regards,
    Bertie Kaal

    Reference: Dunmire, Patricia. 2011. Projecting the Future through Political Discourse: The case of the Bush doctrine. Amsterdam: Benjamins.

  2. 29 October, 2012 at 16:17

    I would interpret your question practically in terms of the probability theory: a trend of a future event can be represented by its probability distribution. For example, what is the probability distribution of the future event “Obama or Romney wins”?

    I would build datasets from your data and build a probabilistic classifier. Then, I would get the resulting event distribution from the classifier. For example, I would obtain the mean and the variance of a normal distribution from the corresponding classifier.

    Alternatively, I would split the event in two “Obama wins” or “Romney wins”, build distribution of the first event and the second event and compare resulting distributions, for example, graphically, to tell what is the trend.

    Alexander

    • 30 October, 2012 at 14:18

      Hi Alexander,

      your classifier would probably work if my problem was to guess WHO is going to wen next American elections.

      The trend that I would like to detect is instead independent from the winner of the precidency…

      • 30 October, 2012 at 16:43

        Hi Marina,

        my apologies if I insist on my solution: I don’t see a semantic connection of the features I extract in my classifier and the income of classification. You can extract, for instance, stopwords to classify a text according to the trend representing Americans’ confidence in a better US economy in years 2012-2017.

        I studied this connection in my PhD on opinion mining. I extracted not only bag-of-words features that are typically used in opinion mining but also other features as stopwords or stylometric features that don’t have any emotional meaning. The result was: classification results were not much better.

        You might be interested in my contribution that addresses this issue:

        Osherenko, A., & André, E. (2007). Lexical affect sensing: are affect dictionaries necessary to analyze affect?. Second International Conference on Affective Computing and Intelligent Interaction, ACII 2007: Lecture Notes in Computer Science (pp. 230-241). Springer. http://www.springerlink.com/content/d136n463k1x5t555/

        Alexander

  3. 30 October, 2012 at 13:04

    I have ran into the following papers that I
    think are interesting and relevant:

    http://cs.stanford.edu/people/jure/pubs/quotes-kdd09.pdf
    http://cs.stanford.edu/people/jure/pubs/memeshapes-wsdm11.pdf

    Best,
    Periklis Andritsos
    University of Toronto
    Faculty of Information

  4. Gary Berosik
    30 October, 2012 at 16:42

    Just a couple of reactions and thoughts, Maria:

    – Your directions appear to be related more or less closely with Black Swan Theory (http://en.wikipedia.org/wiki/Black_swan_theory) and research approaches to predict unexpect large risk. If you have not already done so, you might want to consider some of that predicive research work.

    – Your stated challenge is to predict whether and when (a point in the past) the ENRON SCANDAL could be expected to happen. The cynic in me suggests that the answer to “whether” a scandal will happen within ANY large organzation is ALWAYS “yes”. The real challenge is predictively determining “when”, in the data’s “future”, that events might occur with certain levels of statistical confidence. Feature engineering is always problematic in predictive modeling efforts such as this. I don’t have any suggestions for the ENRON data, though – sorry! 🙂

  5. 31 October, 2012 at 15:44

    Thanks Gary for your suggestions!

    @Alexander, I will get back to you soon…

  6. John McGrath
    14 November, 2012 at 23:06

    The analogy I would use is more like predicting a tornado. No two are exactly the same but their are certain environmental characteristics when combined in numerous ways will predict a high likelihood of a tornado. In your case SCANDAL threshold would be determined by a graph of BoW scores defining required characteristics of a SCANDAL. Reference scores to a timeline. Where the scores exceed the threshold you have a higher probability of finding the origin of events, key events, smoking gun, etc. without entity analytics. Determining the SCANDAL graph is usually based on historical analysis of similar scandals, but then you always have a black swan to worry about. 🙂 Good luck.

  7. 16 November, 2012 at 17:13

    Ok John. I will keep u informed about the trends 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

*