Summary: Where is the future? From big data to contextualized information

Comments to the post: The Path Forward: From Big Unstructured Data to Contextualized Information (www.forum.santini.se/2012/03/the-path-forward-from-big-unstructured-data-to-contextualized-information/)

Discussion on LinkedIn: American Society for Information Science & Technology (http://lnkd.in/EdqJDb)

Tom Reamy • Hi Marina, good blog – and as someone dealing with the idea of context in text analytics for many years, I’m in total agreement as to its importance. There are quite a few other types of context that are important as well. Another conversation.
As far as text analytics tools dealing with this – most of them can but the ones with a full set of operators will probably do best. Two contextual areas come to mind immediately – how to get TA software to recognize context like genre when it is not specified and how to take context into account in categorization or extraction rules.
The former can be done with both statistical approaches and rules based, but the latter is something that you really needs rules for.
As to who does it,there a whole confusion of overlapping offerings (one of the services my company has been doing a lot of lately is helping companies do text analytics evaluation projects) but the top commercial names are SAS and IBM on the full platform end and Smart Logic, Expert Systems,and Concept Searching, and others mid-tier, and then sentiment analysis companies do some of this – Attensity, Clarabridge, Lexalytics, and lots of little companies like Janya and AI-One. Lastly there is opensource like GATE. Way more but in interest of my typing fatigue, let’s stop there.

The real trick is not the technology but building rules or statistical models that are not too brittle but sufficient depth -but that’s another topic entirely.

Marina Santini • Hi Tom, thanks for your useful list. My experience with some of these products is that they need substantial tweaking to identify genre and sublanguage differences…
Or as you say they are still “too brittle” 🙂
Cheers, marina

Tom Reamy • Hi Marina,
I’m not sure what you mean by tweaking – certainly none of those packages could categorize genre out of the box, but they are all designed to support developing rules that should enable them to characterize genre as long as there were indicators of genre in the text. I’ve mostly worked on topical categorization and entity extraction using the tools I listed, so haven’t really looked into what features you might use to create a set of genre classification rules.
If you could provide some examples of genre classification done by humans, I might be able to give you a better answer as to what kind of effort it might require to develop some automated rules. Or if it is possible at all given current technologies. For example — What kinds of features would a human typically use to categorize genre? What are some examples of genre that you find particularly useful to distinguish?

As far as the comment about “brittle” I was referring to the generality of rules you might build – it is often tricky to build rules that work with high precision for a particular set of documents but that fail when applied to new documents.
Thanks,
Tom

Marina Santini • Hi Tom,
as for examples of genre classification done by humans, Mark Rosso worked a lot with this topic. Here are some references:
* Abstract: Identification of Web Genres by User Warrant (http://www.forum.santini.se/2011/03/abstract-identification-of-web-genres-by-user-warrant/)
* USING GENRE TO IMPROVE WEB SEARCH (http://ils.unc.edu/~rossm/Rosso_dissertation.pdf)
* Mark A. Rosso: User-based identification of Web genres. JASIST 59(7): 1053-1072 (2008)

My own experiment with genre classification by humans is described here:
* Santini M. (2008). “Zero, Single, or Multi? Genres of Web Pages through the Users’ Perspective”. Information Processing & Management. Volume 44, Issue 2, March 2008, pp. 702–737.
When Mark and I tried to compare human classification and a state-of-the-art genre prototype, we got the following results:
* Testing a Genre-Enabled Application: A Preliminary Assessment (www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf)

An interesting small experiment is shortly described here:
* Rehm G., Santini M., Mehler M., Braslavski P., Gleim R., Stubbe A., Symonenko S., Tavosanis M. and Vidulin V. (2008). “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems”, LREC 2008. Marrakech. <http://www.astro-susi.de/genre/lrec2008.pdf>
* Case Study: Assigning genre labels to random web pages (http://www.webgenrewiki.org/index.php5/Case_Study:_Assigning_genre_labels_to_random_web_pages)
Your experience and insights will be valuable for me, especially because you have worked with topical text categorization. Many genre researchers claim that topic, genre and sublanguage are different concepts, although overlap exists among them. Presumably we need different features (and/or computational methods) to capture them. In my opinion, the great advantage with genre and sublanguage detection is that they can tell us something about the “context” where information has been uttered/issued/spread, thus reducing ambiguity and misunderstandings. In my experience most of state-of-the-art packages (and approaches) are topic-oriented, which is not enough to give account of genre and sublanguage.
Looking forward to hearing your opinion.
big thanx, marina

——————————————
Discussion on LinkedIn: BCS Information Retrieval Specialist Group (http://lnkd.in/7XNRgX)

Alexandro G.S. Mancusi • Could you approach this as a classification problem and use SVM? We have employed such methods to work on very large document sets. We work together with Treparel Information Solutions to handle similair challenges but it helps to break it down further so you can maybe employ a mix of technologies.

Marina Santini • Hi Alexandro,
It is indeed a classification problem. In all the text analytic solutions that I have used/tried out so far, I have not seen any genre/sublanguage/domain classifications implemented. But of course I might have missed some of them. That’s why I would like to have additional suggestions. My claim is that we should add (at least) these three kinds of textual classifications to text analytic tools to assess the real value of information. In simple words, we need the “context” provided by these three concepts to value information. I would say “conceptualize rather than tweaking”.
As far as algorithms are concerned, SVM is very powerful and performs well with genre classification on smallish corpora/dataset (say,5000 documents).
Could you tell me your results on genre/sublanguage/domain classifications? How many documents in your datasets? Which P/R? Which features?
Cheers, marina

Alexander Osherenko • As far as I know, it doesn’t matter much what classifier you use. It is even not very important what feature evaluation method you use. For example, you can use SVM (analytical classifier). However, NaiveBayes (probabilistic classifier) can outperform it on particular kind of data. You can use frequency feature evaluation method. However, it can be outperformed by the presence feature evaluation.
I assume classification results depend rather on features resulting from the nature of information you classify and not on a classifier or a feature evaluation method. In the case of texts’ classification, I considered text length or grammatical correctness of texts.
Concerning results, the recall and precision values were in most cases in my experiments very close and I could establish a rule of thumb — they are about triple of choice by chance. For example, for a 5-classes problem the R/P values were about 60% (20% choice by chance).

——————————————
Discussion on LinkedIn: Text Analytics (http://lnkd.in/N2XduH)

Ron Katriel • Hi Marina. If I understood you right, you are concerned with making sense of multiple information channels, each with its own unique structure. Automatic discovery of structure within a channel is a form of unsupervised learning which, as you probably already know, is a very difficult problem (also known as density estimation in statistics). It is related to artificial intelligence concepts such as data mining, clustering, feature extraction, and dimensionality reduction. You may also want to also take a look at new work being done on Sentiment Analysis. It looks like it partly overlaps with your area of interest, even if the goals are somewhat different.

Marina Santini • Hi Ron,
I see it more as an automatic text classification problem. Genre, sublanguage, domain are TEXTUAL dimensions that we can use to understand (or infer) the context in which a document/text/conversation has been issued. Say, that we want to automatically extract the most important information in a large set of telephone conversation transcripts from a call center. If we just extract the most frequent words/unigrams (with any kind of sophistication), in many cases this will not be enough and we will have to tweak a lot. In order to single out the most important information we need to know how human communication works in a certain context. Genre, sublanguage, domain (and other textual dimensions) help us to reconstruct the communicative context.
As I said to Alexandro (http://lnkd.in/7XNRgX), in my view we should add (at least) these three kinds of textual classifications to text analytic tools to assess the real value of information.
What do you think?
Cheers, marina

Ron Katriel • Hi Marina,
I can see why you would consider these a classification problem but to be able to treat them this way you need a “teacher” (i.e., a large training set that is already classified).  For example, while working at an electronic news integrator, I led an R&D effort to automatically classify (tag) news stories in real time.  The Genre included news feeds from vendors such as Business Wire, PR Newswire, Reuters, AP, etc.  Some of the sources were already tagged using a classification system that included business, sports, entertainment, weather, etc., concepts (Domains).
A key insight was that one can view a coded news channel as a teacher, using it to train a classifier to augment the coding of channels that are not coded or have less sophisticated coding systems, thus ensuring uniformity of coding across all channels.  This works because, regardless of vendor, electronic news articles share many stylistic attributes in common such as having the key information show up early in the article and have similar style due to the common audience and editorial influence.
In the case of news stories, the word/phrase frequency approach had to be augmented to account for structure (headline vs. opening paragraph vs. body of article), word/phrase counts had to be normalized with respect to the length of the article (to obtain frequencies of occurrence), and special entities such as organizations, names, stock symbols, locations, currencies, etc., were automatically tagged to increase their effectiveness. The relative importance of such Sublanguage aspects had to be tuned via a trial and error approach since there was no analytical way of doing this.
I hope this helps.
Cheers,
Ron

Marina Santini • Hi Ron,
what an interesting answer! Yes, Ron, your answer helps me understand better how the concepts of genre and/or sublanguage are dealt with.
What you call “a key insight” is basically “genre awareness”. Your “stylistic attributes” (aka “genre features”) are normally employed for AGI (Automatic Genre Identification).
Did you use a vector representation? How did you account for the compositional structure, i.e. title, opening paragraphs, etc? with weights? How big was your training corpus? and test set? Do you have any paper describing your experience?
An intresting experiment that might be of intrest is summarized here: http://www.forum.santini.se/2011/03/abstract-formulating-representative-features-with-respect-to-genre-classification/
You say there is no analytical way to accont for genre and sublanguage… I am arguing the opposite 🙂 Once you conceptualize them, the door towords the future is open…
What do you think?
Cheers, Marina

Ron Katriel • Marina,
This work dates back to the mid 90’s, well before the concept Genre classification became a prominent subject of research. But, ultimately, it all ties back to the underlying statistical concepts of probabilities, priors, distributions, stratification, etc., so it is no surprise the same fundamental concepts come up again and again with different names.
Our approach was indeed vector based. We typically used between 100 and 1,000 features (words and phrases) per class, depending on the size of our training sample. We had hundreds of thousands of tagged news stories but as expected (i.e., Zipf’s Law) a few categories were over represented while the majority of them occurred less frequently.
The learning system depended on two key innovations: a clever feature selection algorithm (ranking words and phrases based on their ability to separate two distributions – news stories tagged as belonging to the class of interest and the rest which do not) and a high dimensionality sparse matrix linear equation solver. After several years of tuning the code we achieved recall and precision equaling or exceeding those of human coders.
As you guessed, structure was accounted for using weights. The primary components were headline (somewhat important), opening paragraphs (very important) and the rest of the news story (less important). The weightings were arrived at empirically by trial and error but the overall performance was not highly dependent on getting this perfectly right. This is definitely a research area which could lead to improvements in performance.
Thanks for the link to the article. Would you be able to email me the PDF version? As far as I can recall, we never published our results in a mainstream journal or conference proceeding but I have copies of internal documents on the subject which I may be able to share with you after checking with the company I worked for (the system is still used as a major revenue generator so there may be concerns about sharing detailed information).
Cheers,
Ron

——————————————
Discussion on LinkedIn: The WebGenre R&D Group (http://lnkd.in/8ucizZ)

Daniel McMichael • try Thoughtweb http://www.thoughtweb.com

Marina Santini • Thanx, Daniel.

Nitesh Khilwani • In Meshlabs, we try solving similar problems for different domains (media, insurance, it, banking). http://meshlabsinc.com/.

Marina Santini • Thanks for the suggestion, Nitesh.
Marina

Srikant Jakilinki • @Nitesh Was of opinion #Meshlabs is inactive. Your demo video on website has been defunct and so also the copyright notice

Nitesh Khilwani • Hi Srikant,
Thanks for pointing out this.
Apologies for this misunderstanding. We are restructuring our entire website and will be up and running soon.

Marina Santini • @Alexander: the machine learning algorithm makes a difference. I tried a naive bayes classifier and an SVM (within WEKA) on the same dataset and got quite a big gap of statistically significant difference.
Have a look at Table 1 in this paper:
Santini M. (2007). “Automatic Genre Identification: Towards a Flexible Classification Scheme”. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007) http://ewic.bcs.org/content/ConWebDoc/13731

Alexander Osherenko • Of course, Marina, you are right — you probably mean recognition rates for “Personal Home Pages” and “Listings” where there is quite a big difference. Although I would be tempted to ask how your data is collected, do you have sparse data and what recall value your classification has as the evidence that classifier learnt classification, but it is not the point.

In my post I was trying to answer the question: how can I improve results of classification for a particular corpus? What means can be considered as beneficial insofar?

Popular answer would be: an appropriate classifier should be chosen. However, the list of “usual suspects” in text mining is quite short — you probably choose SVM or NaiveBayes. Maybe, you would consider ensemble classifiers. However, from my personal experience: classification results don’t become much better and such classification is time-consuming. After trying these classifiers you can try other classifiers or you can try other data mining means such as the method of feature evaluation. However, my findings didn’t confirm the conclusion that classification results can be significantly improved by such choice. Modern SVM and NaiveBayes are good enough and I wouldn’t waste time looking for an even better classifier etc.

That’s why my answer of the improvement question would be: a significant improvement of recognition can be achieved by considering not only data mining issues but more efficiently by extracting other features. Such features in text mining can be lexical, stylometric, deictic, grammatical or their combinations.

I assume that shifting the focus on feature extraction would improve classification more successfully than only on trying to answer data mining questions such as what is the best classifier, what is the best method of feature evaluation and so on.

Leonid Boytsov • I also suspect that features are more important than the classifier. I have a lot of anecdotal evidence supporting the hypothesis that simple methods work almost as well as sophisticated ones. In particular, during an online Stanford NLP course lecture, it was highlighted that Naive Bayes is often as good as the other methods. The difference between a simpler and a more sophisticated method is typically within 10%. Hence, I am quite surprised to see that (supposedly linear) SVM outperformed Naive Bayese by such a wide margin.

A good discussion on this issue can be found here: http://arxiv.org/pdf/math/0606441.pdf

Marina Santini • Hi Leonid,

I know what you mean. I am also following the online Stanford NLP course and I was quite surprised to hear that standard Naive Bayes classifiers are basically suitable for everything.
In my experience and with my features SVM performs always much much better.

Thanks for the reference paper. That’s useful.

Cheers, Marina

Alexander Osherenko • I would say, the classifier choice depends on amount of learning data. If your classifier is learning on sparse data (or similar), it would be more beneficial to use NaiveBayes as a probabilistic approach.

Otherwise, if you have lots of data, SVM (as analytical approach) is more beneficial.  It learns better, but you must consider overfitting and thus use generalization.

Marina Santini • Hi Alexander,

as far as I know with support vectors, overfitting is unlikely to occur. I had small datasets and sparse data. In that experiment, my features were morphological and syntactic features (ftp://ftp.itri.bton.ac.uk/reports/ITRI-05-02.pdf) extracted from web pages.

In a previous experiment, I got very good results with a Naive Bayes classifier, with different features, i.e. POS trigrams and small datasets extracted from the British National Corpus (http://www.cs.bham.ac.uk/~mgl/cluk/papers/santini.pdf).

I would say that there are always three factors to take into account in text classification: documents, features, classifier.

Cheers, Marina

Leonid Boytsov • Hi Marina,

Please, note that NLP lectures was just an example. I also googled on the issue of naive Baeyes vs more sophisticated methods. I have not found a single paper claiming huge difference, e.g., between SVM and NB, or SVM and decision trees. Here is a typical example:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.751&rep=rep1&type=pdfIn my experience, a 10% rule holds not only in classification, but also for other information processing tasks. In particular, in Ad Hoc search. It is very hard to beat a simple method by a wide margin consistently. Remember the Netflix competition, which can also be considered a classification problem.

Alexander Osherenko • Hi Marina,

you acknowledge what I said: if you have small datasets and sparse data you should take NaiveBayes. Otherwise, you should use SVM. 🙂

You don’t have much more — documents, features, classifier.

Cheers
Alexander

Marina Santini • Dear Alexander and Leonid,

It might be that genre classification is more unpredictable than classification with standard datasets. That’s why we need more research in this field 🙂 If you wish you can use my datasets and carry out a comparison betw Naive Bayes and SVM. It would be interesting to see your results. I used WEKA’s implementations of Naive Bayes and SVM.What do you think?@Alexander, I had small datasets in both experiments. SVM was run only on 1400 web pages divided into 7 classes. I would not say this is a big dataset.

Marina

Alexander Osherenko • Marina,

I would be delighted to run WEKA.

Alexander

Leonid Boytsov • I’d be happy to watch for an outcome. 🙂

Leonid Boytsov • leo at boytsov.info

Leonid Boytsov • Please, note that I am not saying that this is impossible. I just could not find any hardcore evidence that sophisticated methods are MUCH better. It would be quite interesting though to find a class of datasets when this is the case.

Alexander Osherenko • Hi Marina,

regarding our discussion about overfitting and SVM I reread the definition of overfitting by Mitchell in “Machine Learning”, 1997, p 67. He defines overfitting using search space and an error threshold and doesn’t mention classifiers. Accordingly, overfitting can occur in every classifier and generalization can ease corresponding problems.

If we consider NaiveBayes and SVM we see that NaiveBayes doesn’t have unfortunately necessary parameters to boost generalization in contrast to SVM.

Cheers, Alexander

Marina Santini • Hi Alexander.

yes I understand…

Since I used weka, I also used the data mining book associated with it (http://mestrado.deinfo.uepg.br/mestrado/docs/WittenFrank.pdf).

About overfitting of support vectors, it says (p. 217-218) “With support vectors, overfitting is unlikely to occur. The reason is that it is inevitably associated with instability: changing one or two instance vectors will
make sweeping changes to large sections of the decision boundary. But the
maximum margin hyperplane is relatively stable: it only moves if training
instances are added or deleted that are support vectors—and this is true even
in the high-dimensional space spanned by the nonlinear transformation. Overfitting
is caused by too much flexibility in the decision boundary. The support
vectors are global representatives of the whole set of training points, and there
are usually few of them, which gives little flexibility. Thus overfitting is unlikely
to occur.”

Let’s see what happen with your own runs on the genre dataset… I am very curious…

Cheers, Marina

Alexander Osherenko • Hi Marina,

astonishing. I’ve performed classification using SMO and NaiveBayes in Weka 3.6. The results are the same as you describe. SMO is significantly better than NaiveBayes (89% vs. 67.14% averaged recall value). I composed a 2-classes dataset that consists of 400 instances of your dataset by leaving only first 400 rows. In the 400-instances dataset I get 99% vs. 96.25%. Moreover, I removed 50 first features from your dataset leaving the rest 50 — 67.07 vs. 56,57.

For comparison, I also classified my 3-classes dataset with 765 instances. This dataset contains instances for affect sensing in short texts with extracted features relying on Bag-of-Words, or deictic etc. I get there 36.96% SMO vs. 40.58% NaiveBayes.

That’s why my explanation lies in the comparison of the method of feature extraction: hand-picked vs. automatic. Evidently in your dataset, you carefully extracted features such as third_person or expressiveness. Hence, there is no difference between disposition of classification results whether the dataset contains 1400 vs. 400 instances or 100 vs. 50 features — SMO is always better than NaiveBayes in this case. In my dataset, features are extracted automatically under consideration of their frequency in the corpus and not hand-picked. Thus, NaiveBayes is better than SMO.

I’ll tell you if I find more.

Cheers, Alexander

Leonid Boytsov • Hi All,

I have not been able to reproduce Marina’s results, because NB classifier works better in my scenario. Linear SVM has equivalent performance. I also tried non-linear SVM, which was better by 6%. I also experimented with a multi-label classification. In this case, NON-LINEAR SVM was 16% better, but this likely be due to overfitting. Anyways, I believe that a non-linear SVM should be better by about 10% than NB, but not more.

See a description below.

I did not rely on cross-validation directly, because a lot can be achieved by chance on such a small dataset with so many features. Instead, I employed bootstrapping (90% of points are randomly selected for training and the rest for validation). Results were averaged over boot-strapped series. I considered two cases: a series of binary classifiers, which was apparently Marina’s choice: blog, vs non-blog, home page vs non-home page, etc. Then, I did a multi-label classification.

In the binary case, NB has 86% accuracy, the linear SVM had 86% accuracy and the NON-LINEAR SVM has 92% accurcy.
In the multi-label case, NB has 60% accuracy, the linear SVM had 62% accuracy and the NON-LINEAR SVM has 70% accuracy.

Note however, that over-fitting with NON-LINEAR SVM is most certainly happening here. I did not have a separate training set and select the kernel parameter that yielded the highest accuracy.

I can make my scripts available, if anybody wants to reproduce my results. One would need a Unix machine with C-compiler and Perl for this task.
A spread-sheet with results is available here:
https://docs.google.com/a/boytsov.info/spreadsheet/ccc?key=0AveqTL0qdHxMdDhBdzdSa3Y4YzhCSUFFYTVTdXBUY0E#gid=0

Marina Santini • Hi Leonid,

thanks a lot for your evaluation.It seems to me that the performance of a specific classifier depends very much on the way it has been implemented.
Did you write our own classifiers or did you use an standard statistical package?
By “multi-label” do you mean “multi-class”?In both cases (NB and SVM) I applied a multi-class classification, so apparently WEKA’s accuracies are higher than yours.By the way, while Weka’s NB has a standard probabilistic implementation, Weka’s SVMs are trained with SMO (the sequential minimal optimization), using polynomial or Gaussian kernels. The full description of the implementation is on page 410 of this bookhttp://mestrado.deinfo.uepg.br/mestrado/docs/WittenFrank.pdf.

I agree with you, this dataset is too small and too balanced to make any conclusive statement.
But this is the state of the art of automatic genre classification: small genre collections 🙁
If you hear about a large genre corpus, let me know.

I made the genre dataset available online

(https://docs.google.com/spreadsheet/ccc?key=0AmOc7K2vrA_8dGUtcnV0TDRJU0syQjB6Z1FQc194a3c#gid=0)

so it can be evaluated also by other people.

Thanks again for your time and for sharing your experience.

Cheers, Marina

 

Alexander Osherenko • Hi Marina and Leonid,

you are talking about data mining issues. You discuss NB or SVM or what kernels to use.

These are very important issues. No question — it is an acknowledged truth that the most reliable results can be obtained using these two classifiers. However, in my initial post I wanted to discuss another side of classification: the semantic meaning of extracted features.

In the dataset of Marina (thanks!), I see many features that resemble on my own datasets — these are grammatical or stylometric or deictic features that I used for opinion mining. Although I am not confident what was the reason for extracting these concrete features, they obviously contribute to correct classification and I want to comprehend why. This has a very practical reason: If I develop a believable classification system, for example, a genre classification system I have to consider such semantic issues.

Cheers, Alexander

Leonid Boytsov • Hi Marina,

How do you compute accuracies in the multi-class case?

Leonid Boytsov • Alexander, of course feature engineering is primal, at least in my opinion. Yet, it a ML algorithm also may make a difference.

Marina Santini • @Leonid: Multi-class problems are solved using pairwise classification. Have look here: http://weka.sourceforge.net/doc/weka/classifiers/functions/SMO.html

@Alexander: if you wish your can also try “light” genre-revealing features. Have a look at these three papers:* http://www.forum.santini.se/2011/03/abstract-formulating-representative-features-with-respect-to-genre-classification/* Kanaris, I. and Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing and Management, 45:499–512.

* Mason J., Shepherd M. and Duffy J. (2009). An N-Gram Based Approach to Automatically Identifying Web Page Genre,” Hawaii International Conference on System Sciences, 42nd Hawaii International Conference on System Sciences, 2009.

Cheers, Marina

Leonid Boytsov • Marina,
Thank you for the explanation, but the question still remains. WEKA computes weights, not accuracies. In your paper, the binary classifier error is essentially equal to the multi-class classifier error. This is possible, but is *VERY* unlikely.

Alexander Osherenko • @Marina. Thank you for your papers. If you have other description of your work and the features you extracted in your dataset especially their semantic interpretation, I would be happy to read it.

Although we are working on different tasks: you on genre identification and I on opinion mining, I think the semantic conclusions on utilized features can be the same. I have already started my work and it seems I can apply your conclusions to my work (Alexander Osherenko. Considering Impact of Sociolinguistic Findings in Believable Opinion Mining Systems. Proceedings of The Fifth International Conference On Cognitive Science. http://www.informatik.uni-augsburg.de/~osherenk/final_kalinigrad.pdf).Marina Santini • @Leonid: if you wish more technical details, maybe you can contact the WEKA teams for the implementation of their SVM. They have a very proactive forum/mailing list. One idea is that we send to WEKA development team the genre dataset and ask for their comment/feedback on the performance of NB and SVM…

@Alexander: in this experiment (Cross-Testing a Genre Classification Model for the Web:http://www.forum.santini.se/2011/03/170/) I use the same features with a different classifier (an probabilistic model with weights)…

 ————

Discussion on LinkedIn: The Semantic Web Analytics Group (http://lnkd.in/AzHsqr)

David Dodds • I would like to add the following discussion to Marina’s discussion: compare and contrast (the process of) confabulation with that of Summary / Contextualized Information. Example of confabulation is using that idea to produce (decent quality and accurate) natural language characterization / description of the states and activities in a neural network or genetic algorithm.

Marina Santini • Hi David,

can you give examples of “confabulation”? Do you mean “conversation”, “dialogue”, or something more psychological? In which field(s)? Could you expand a little your comment?

Cheers, Marina

David Dodds • Hi Marina,

I did not mean (in my posting) “conversation”, nor “dialogue”.But now that you mention it, the ideas I’m thinking about can be expanded such that human feedback to received machine-generated confabulated natural language depicting the activity/meaning/processing within, say, artificial (or even real) neural networks and or genetic algorithms could be delivered back to the machinery which did the confabulating and that machinery could perform actions associated with the content of the human feedback.
(That would constitute a kind of conversation, but not one like between humans.)
One possibility is that perhaps the algorithms and or data for/in the neural networks and or genetic algorithms could be adjusted (based on the feedback to the confabulated description of their activity).One example might be a means of providing ‘hints’ from the human to the neural networks and or genetic algorithms, and by hints I absolutely do not mean any communication of human-written direct computer programming changes. (I suppose if one really had to for some reason one might call it a ‘monologue’ from the machine-generated confabulated natural language producer resulting from its processing of the data it received from the states and processing/operation of the neural networks and or genetic algorithms it was monitoring.One serious important difference between this machine generated ‘dialogue’ and one generated by a human blethering on speaking stream of consciousness is that the machine version (apart from likely having likely dreadfully unskilled grammar) has NO BENEFIT FROM ANY KIND OF AWARENESS (aka ‘situated’ awareness (sometimes called a ‘self’)). (I don’t like using the term consciousness it is too often abused and used incorrectly).digital fingerprint processing, face recognition (algorithms) signal recognition and categorization. EEG heart monitor is analog. doctors can watch the waveform and recognize the meaning in real-time. It might be called signal recognition. tachycardia, arrhythmia etc. Confabulation of EEG signals would be able to perform such categorizations and also present natural language depicting the activity/meaning. Maybe English medical terminology such as tachycardia, arrhythmia could be displayed or synthesized-speech-spoken (in real time) as a result of a person lying on an operating table and whose EEG signal depicts those things about the patient’s heart. This is signal categorization but not really understanding it just recognizing it. This is the difference between being able to see the characters on a Chinese newspaper and being able to read the same newspaper.Another view is that it is a sort of speaking of the train of consciousness, just saying out loud what one perceives in one’s mind as it happens. this is not at all a good analogy though because the mind is highly qualified at generating natural language, operating one’s mouth, and translating most any of one’s thoughts or mentations into words/natural language. There are presently no machines or computer algorithms (software) which begin to hold a candle to the human brain’s capability to do these things. Confabulation may be considered those brain capabilities just mentioned, which are far more complicated/richer than merely categorizing sensor signals or weight values in some highly oversimplified model of human neural networks.There is presently some work being done with world renowned physicist at attempting to translate brain scans into language but the research has not been completed yet. To me this is a kind of confabulation also.One of the kinds of confabulation I am working on these days is translating the contents of MathML XML instances into confabulated natural language depicting not only the mathematical terminology of the items there but also the mathematical conceptual meaning of the MathML instance, such as “Gaussian Quadrature math”.If you have further questions, ask.
cheers
Marina Santini • Thanks, David. I understand better now. It sounds fascinating…

Cheers, Marina

Leave a Reply

Your email address will not be published. Required fields are marked *

*