Post signed by: Maya Dimitrova, Institute of Control and System Research, Bulgarian Academy of Sciences
* In this post, all references, figures and tables have been removed by the blog’s moderator.
3 Gestalt in Information Retrieval
A group of information retrieval studies is concerned with identifying new linguistic, lexical or formal features (like the special tags) that can be captured by automatically processing html scripts – scanning, tokenizing, clustering – and extracting meaningful information to identify the style or genre of the text inside the Web page. Web genre in the discussed group of studies is defined as a multi-dimensional structure of features of text and html design pointing out at various linguistic and cognitive aspects of the retrieved Web document to help the user find not just the relevant topic, but also to get a clue about the style it is written in, the level of superficiality, detail, technicality, readability, explanation and illustrativity in it. In order to make the process of automatic classification along various dimensions more efficient and sufficiently fast, a lot of heuristically based knowledge is employed by Web designers. In many cases, these heuristics are grounded in long-tradition research areas like rule-based reasoning, neural networks, linguistics and cognitive science. Our aim has been to apply heuristics based on cognitive sciences as justifiable on the grounds of experimental research and knowledge about human cognition.
3.1 Emergent Concept(s) in Web Information Retrieval
We focus on current research most closely relating Web genre with stylistic information retrieval, which can be illustrated by a Venn diagram. The Web genre research area of the Venn diagram represents the group of most recent studies, which seeks for clusters of formal features of Web sites to define Web genres. The main concern within this group of studies is the better utilization of the knowledge about the classified genres and their visualization from a systematic perspective.
The second area of studies comes from information retrieval research for automatic discovery and recognition of writers’ style. It is our observation that this group of studies have been of significant influence to information retrieval in the last twelve years with researching the issue of style and its formalization and stressing out its importance – for the writer, for the reader, for the crawler and the wrapper inside the Web – as a bridge between semantics and formalization; structure and access; usefulness and aesthetics in user-Web interaction.
Our attention in discussing current Web genre research will be on the small overlapping portion of the diagram – where researchers feel it important to identify features belonging to both areas of Web genre and text style research. For this aim we take the style definition of J. Karlgren: “Style is, on a surface level, very obviously detectable as the choice between items in a vocabulary, between types of syntactical constructions, between the various ways a text can be woven from the material it is made of. It is the information carried in a text when compared to other texts, or in a sense compared to language as a whole. This information if seen or detected by the reader will impart to the reader a predisposition to understand the meaning of text in certain ways.” In considering Web genre we will try to show that it is an emergent concept. Style is not Web dependent – it is the base of genre. Ingvar Johansson has defined base and supervenient levels of description of an “emergent whole”. The lexical, linguistic and grammatical aspects of style are the “concrete collections”, some of which may and some may not be relevant to the emergent Web genre concept. In other words – the text style is the necessary background for the genre figure.
3.1 Gestalt Principles and Processes in Web Genre Studies
Sometimes authors do not essentially discriminate between genre and style of the text inside the html page. This is done for the purposes of the algoritmization of the information retrieval task of genre classification and analysis. The authors give a quite operational definition of Web genre: “Genre is an abstraction based on a natural grouping of documents written in a similar style and is orthogonal to topic. It refers to the style of text used in the document”. Two genre dimensions are reliably identified based on textual and linguistic features in this study. These dimensions depend on whether the contents express fact or fiction (the subjectivity dimension) and whether they involve a positive or negative review of a given topic (the opinion dimension). A comparison of different classification approaches is made and it has shown that the part-of-speech tagging (POS) outperforms the bag-of-words (BOW) approach for the subjectivity dimension. It has also shown a good domain transfer, for example, from football to finance and vice versa. The review classifier, however, performed less satisfactory with POS tagging being even inferior to BOW approach. The BOW approach performed sufficiently well across all of the domains.
While assuming orthogonality of these dimensions, the authors actually have in mind the exclusivity assumption. The document classifier is supposed to identify either the genre or the topic as it is – based on the employed various textual and linguistic elements – word and sentence length, grammatical elements, specific words and tags. The post hoc analysis of the root stems of the C.4.5 classifier in this study revealed some interesting observations on the actual words used for automatic processing of the text, where some overlap or bias was observed, thus violating the orthogonality assumption. For example, frequently using the word “romantic” closely corresponded with the positive review on the restaurants, whereas it was quite negatively associated with the film review texts.
One of the problems with building text classifiers, therefore, is that the actual meaning of a word cannot be completely isolated from the process of automatic scan of texts to collect computational features which define aspects of style and dimensions of genre. From a cognitive view, however, this is less than unexpected and the relation of ‘quasi-independence’ between processes, which ideally are expected to be independent in stochastic terms, is more than frequent, see, for example the works of Endel Tulving. The orthogonality assumption in the context of the presented study is an abstraction – some ideal entity to be pursued in research and modeling. The approximation level is the very natural way people understand things, let alone in Web context. From a Gestalt perspective, what is important is not to skip the supervenient level of analysis of the phenomenon in focus, as Ingvar Johansson has revealed. Meaning influences automatic Web document classifiers as an artifact or a fault. Meaning influences human understanding as the bounding feature of human scanning or hearing of texts.
The unavoidant supervenient level of analysis of the “emergent whole” in content-based image retrieval systems is explained by Ingvar Johansson. According to him, whenever a complex structure that is emergent on some discrete base is described (like a picture on the screen or canvas on the basis of discrete pixels or color dots), the following have to be present in the process of analysis: “With respect to emergent wholes there is no plain inference rule, but there is a supervenience rule; with respect to concrete collections there is a plain inference rule and, therefore, a supervenience rule, too.” Johansson illustrates the idea of the emergent whole with the example of the popular smiley :-). When the elements – the punctuation marks – are placed in arbitrary order there is a plain inference rule to preserve their individual format and their spatial relations. This rule may hold for their spatial configuration and it can define a configuration, independent from the one which produces the idea of a face in the viewers’ perception. The face configuration is dependent on the individual configurations and formats of the elements, but emerges on a different – supervenient – level, obeying a different rule on this next level – a rule describing a face. What is important is to bound inferentially the two levels and to avoid skipping over the deeper level – the meaning – from any possible interpretation of the surface features. The idea of a face is thus bounded by the format of the individual elements and not that much by their spatial relations any more, since the supervenient rule requires this. As it is formulated by Barry Smith: “That is, just as we conceive the complex whole in [A] as possessing a certain characteristic whole-property, so we can conceive the different parts of this complex as possessing their own characteristic part-properties in virtue of which they come to make up that total whole which is the original Gestalt”. To illustrate the preserved part-properties in the smiley Gestalt, just imagine a smiley on a graffiti wall or even a smiley in a graffiti paining (figure-ground). Color, paint and texture have their own complex part-properties and nevertheless belong to the smiley as a face.
As Ingvar Johansson points out – skipping the deeper level from the analysis is in fact reduction of the phenomenon (image composition from color dots) – and is not simplification of the computation and the rule inference process. Quite on the contrary – this will lead to “unnecessarily complex” machine inference rules with no guarantee for validity. At the same time it is possible to formulate logical, non-contradictory and simple rules at both levels of existence of the grounded phenomenon (the emergent whole) which can be computationally and cognitively justified.
From a Gestalt perspective it is necessary to mention another aspect of the smiley example of an emergent whole – the simplicity account of Max Wertheimer (1923). Nowadays a smiley is a two-element entity – happy 🙂 and sad 🙁 face – with maximally shrunken base, and richness of meaning at the supervenient level – expressing simultaneously a face and a palette of emotions – and is a Gestalt. The smiley Gestalt is the result from a dynamic (evolved in time) cognitive process, it is immediately given in cognition, memorable, emotionally rich and socially relevant and reflects the special kind of Gestalt complexity defined by Edwin Rausch (1966). The smiley has emerged from the electronic communication process of sending emails to friends and not that much from communicating with authorities and institutions. We think that this emergence of new signifiers of human thoughts and emotions via the Web is what Web is and should be about in its communication with the user.
We will illustrate the bounding of levels in the lexical representation for automatic classification of texts from the Web with some very close studies to the presented above Web genre study. In our studies we have always referred to the multifaceted user preferences of Web sites, which is quite different from the multifaceted representation of Web document genres used for graphical depiction of clustered cites. The study next discussed was performed closely and at the same time with the previous one, sharing the main idea of identifying various and, more importantly, multiple dimensions of genre in Web pages. One different aspect of our studies is that the cognitive independence of the dimensions was tested with several user studies. Another difference is that a heuristic derived from experimental studies, which tested cognitive phenomena, was employed to optimize the process of automatic scan, tokenization and classification of Web pages.
The “orthogonal” dimensions represent document features for which we assume there is cognitive and linguistic potential to be implemented in automatic Web document classifiers.
Depicting user preferences for the depth of knowledge and extent of explanation in the sites to read about a certain topic like “Pearson correlation”, “Neural networks”, “Java servlets” or “Thatcher illusion” can be graphically displayed along dimensions drawn as crossing straight lines. In order to suggest to the viewer that we also care about the on-topic vs. off-topic precision of the returned results by our user interface, we shift the angle between the straight lines and draw a six-ray star conventionally assuming this is the simplest way to suggest 3D visualization of our three intersecting axes. At the same time we are confident in making the claim that we are representing “orthogonal” dimensions in the 3D visual space.
The depiction is simple, incomplete and even inadequate (in the sense of Gestalt inadequacy to any 3D visualization principle of a simple Cartesian coordinate system. At the same time the complexity is there behind or inside the drawing – the idea that in 3D view we will be able to see just a facet of it; that we can easily interpolate the location where our – let us say – quite expert, but fairly brief text will be placed inside this imaginary cube, as well as at what relative distance it will be from the place where some, for example, rather long but fairly popular text on the same topic will be. In our case – the inadequacy is the central (Gestalt) quality of the representation, optimizing the clearer understanding of the meaning and avoiding cognitive and perceptual conflict, rather than introducing it. What cognitively takes place, while viewing such a depiction of the knowledge preferences of the user, is a process of internal cognitive folding of the six-ray star into a 3D image of a cube. What is more – the six arrows force us to mentally rotate it all the time in order to internally perceive all six facets of our imaginary cube as well as to spatially explore the inside volume of the cube. Support to this spatiality of mental imagery rather than reflecting objective visual properties like color and brightness of the visual field can be found in the experimental works of Alan Baddeley on the nature of the visuo-spatial sketchpad in human working memory. This kind of mental folding of a 2D imaginary spreadsheet into a 3D cube and unfolding of the cube into the spreadsheet back and forth is an involuntary cognitive process, which introspectively is very well felt by anyone who is trying to understand the meaning of the drawing. The internal dynamics can be very well explained in Kurt Levin’s and Bluma Zeigarnik’s terms. The six-ray star is influencing our imagination the same way as the Necker cube is influencing our visual perception. We see introspectively constant movement of the geometric elements which cannot be stopped unless a complete shift of attention is made. To summarize – the Necker cube makes us imagine, the drawing of a star with labeled rays makes us think, and the smiley makes us feel – and they all are Gestalts in the holistic way this entails. In our view similar conventional representations of holistic entities come to life because they capture essential Gestalt qualities, they are abundant on the Web and in a sense the Web is the right cognitive medium for their emergence and existence.
3.2 Emergent Properties of Web Sites
We would like to extract from the variety of possible descriptors of Web sites some internal aspects of the site genre like factuality, subjectivity, expertness and popularity. For this aim we need to make the essential distinction of our studies from studies on measuring Web popularity and measuring the influence of the popularity ranking in the search engines and explicitly say we are not dealing with them.
Measuring popularity inside the Web pages
Our initial and main aim in designing a Web page classifier to assess the genre dimensions was to see if the expert level of description of the topic in a page returned in response to a user query can be somehow heuristically or otherwise captured in an algorithmic way. In a study explicitly addressing the issue of determining the expert level of the Web site, the aim was to identify criteria to describe if a document is a scientific text or a popular science text about diabetes. Two dimensions of Web site analysis are outlined – external and internal. The external dimension reflected the physical nature of the Web site – location on the Web (URL), format, architecture and origin of the site. The internal dimension characterizing the actual Web page as seen by the user reflected three groups of page features – graphical, structural and semantico–discursive. The authors analyzed their corpus of documents along these dimensions and employed a threshold coefficient to assess the level of popularity vs. expertness of these documents. They report a systematic list of features of the analyzed pages as a first step towards identifying a set of criteria for distinguishing among scientific and popular science Web sites about diabetes. Their analysis of how popular Web sites differ from the expert texts on the Web revealed that, for expert level texts, the viewer can rely on features like domain-specific terminology, compactness of text, presence of abstracts and references, author and institution names and addresses. For the level of popularity of description of texts about diabetes, however, it was impossible in this study to outline a similar set of uniform features of the medical sites that are intended to the more general audience: “The main text has variable size and soft colors. It can be well stretched out as well as compact; short, as well as long; It is often broken into paragraphs and it often contains lists of items and sometimes, links”. The authors propose a threshold score as a weighted measure of the effect of these criteria according to their discriminative value. Describing the current state of their own research they write: “This arbitrariness of threshold points out that there is a continuum between scientific and popular science documents”. The interesting point in this study is the mention of the continuum nature of the expert level in a set of documents all reflecting varieties of descriptions of scientific results, in particular relating to helping user acquire information on important issues like health and education. The problem of information retrieval for knowledge discovery on the Web is also formulated as a research issue in the “iVia Open Source Internet Portal and Virtual Library System” and the idea is presented that selecting sites on the basis of the depth of knowledge they contain is related to automatic genre identification studies.
In view of the over-generalized operational definitions of Web genre by the researchers in information retrieval there may be the following question: Is the expert level of the text a genre dimension? We have found most satisfying the following formulation of the function of a genre by J. Karlgren and also most relevant for an affirmative answer: “… the author adheres to the conventions of a genre or diverges from it for a purpose, consciously or not, with varying degrees of success. The reason to chart stylistic characteristics of a text in an information retrieval context is to predict the usefulness of a text for a reader expressed as values along salient dimensions of textual variation or membership of the text in a category. I believe reader perceptions of text functions are central to this task: what readers believe about texts is what underlies categorization schemes and authorship alike”. This definition emphasizes an important feature of text style – the purpose of the author when selecting which words to use in writing – which is similar to a principle in education about the purpose of the learning material, its content and the form it is presented in. An important and specific feature of Web genre is emerging as distinguishing it from genre features in music, literature, entertainment and the like. Web genre seems to be more purposeful and more inviting user attention and therefore – of greater cognitive initiative.
To summarize our review of current Web genre research we feel quite confident to say that Web genre is a Gestalt – it is a complex, emergent, dynamic, specific, transposable concept, not yet fully understood (inadequate) or even described (incomplete), and is unfolding in the researcher thought as an emerging and emergent cognitive entity of holistic nature and social value. The supervenient level of analysis is present in the exploration of its base – the lexical, linguistic, textual, graphical, orthographical elements – the concern about the genre meaning to the users and in their lives (bounding). In order to understand the objective grounds of the proposed process of unfolding (and as later we will propose it as a parallel to the decomposition process) we ask the reader to recall the two element smiley example :), its original form :-), necessary to explain the process of bounding of the base level and the supervenient level rules, which invites reading the original paper of Ingvar Johansson where the base level description is given by the succession of [ – ) : ] and has been initially necessary to introduce the pattern notion. Our concern, in line with Johansson’s, is about being non-reductionist in trying to be objective about building artificial cognitive systems.
As a transition point to the next section where we describe our based on cognitive science as the (hopefully) supervenient level of investigation of the results from the base level rules for physical scan of html scripts and the subsequent tests of the usefulness of our Web document classifier we refer again to the study of Karlgren on how users define a “good” exemplar of a genre class: “Most likely, consistency and formality are less useful quality criteria for the categorization than is apparent clarity. The apparent clarity of the genre example or “exemplar” (to cite Eleanor Rosch on prototype boundries as family resemblances entails immediate giveness of the cognitive entity called genre rather than the more effortful decision of consistency or formality (simplicity principle). We will add that it entails also a certain spatiality of the representation of the idea of the good Web genre on the part of the user.
3.3 Cognitive Science and Web Page Classification
In our studies we have proposed and investigated a cognitive science based heuristic in addition to other frequently used text measures like word count and sentence length in order to reduce computation and optimize performance of a Web page classifier. Our classifier aims at discrimination among documents along two dimensions – the amount of detail and the expert level of the text of a retrieved Web page. The classifier is implemented in a Java servlet, which is built as a wrapper to a search engine. The servlet sends the query to the search engine, extracts the first 20 URLs, opens them one by one, computes our estimates, closes them and displays them graphically via an applet in the user browser window. The user opens the respective Web document by clicking on the yellow boxes.
The classifier is based on two simple formulae on the ratio of the presence of high and medium frequency long English words, indices of technical elements and HTML tags, and the ratio of words-to-tokens. The retrieved results are displayed along two dimensions – the expert level of the text and the amount of detail or explanation in it. The detail dimension along the x axis is computed as the sum of an index of the document length, the ratio of long words to HTML tokens and an index of the presence of images. The expert dimension along the y axis is computed as the sum of an index of the ratio of the presence of long words of high and medium natural language frequency and an index of technical HTML elements (math tags and gif files).
The example dimension derived from cognitive science studies is the expert or popular description of the topic of the text. The motivation to look deeper into the results of cognitive science research has come from some mismatch, in our view, of heuristically based implementations of cognitive and lexical features in automatic text classifiers. We have tried to be more careful about the ad hoc and the empirically justified heuristics in this application.
It is generally assumed that long words seem associated with scientific and expert texts. However, there is no reliable scientific evidence of this to date. An interesting result is, for example, that it is statistically confirmed that long words fit better into longer sentences and these are even correlated. A nice heuristic would be, perhaps, to rely on the amount of long words in a text to account for the relative presence of long sentences, rather than employ punctuation scan in cases when the exact number of sentences is not required for the page classification task. As it was previously mentioned, long words and long sentences are not discriminative for scientific versus popular science Web sites.
A cognitive science fact and phenomenon is, on the other hand, that words of high natural language frequency are processed in a different way than words of low natural language frequency and there is plenty of evidence in experimental cognitive and linguistic research. This is a wide known phenomenon in experimental research on remembering lexical items and their subsequent retrieval. Words that are frequently used in natural language are easily remembered for subsequent recall tests. These words, however, are difficult to be identified if the retrieval task is to recognize the studied word among other non-studied words. The reverse is also true: low frequency words are easily recognized as studied before, however difficult to be free recalled. This phenomenon is called the “word frequency effect”.
In building the Web page classifier our aim was to see if relying on cognitive science can help us formulate a supervenient rule, on top of the lexical, linguistic and orthographical base of the outcome of the html script tokenizer to simplify the processing and reduce the computational time and cost. Adding, however, yet another rule in the if-then algorithm is by no means simplifying it. We tried to look from another angle and reduce the actual lexical database. We decided to fetch just a part of the corpus of the words of the natural language norms given in the so called “Brown Corpus” and chose to look at the high and medium frequency long English words. We designed a combined heuristic using an index of the occurrence of high and medium frequency long English words in Web pages (i.e. the tokenized html scripts).
One clarification of this idea has to be made. Of all the words in the world, the majority are low-frequency words. By definition low-frequency words are the ones that have natural language frequency (NLF) norm below 10 usages per million words. It is intuitive that experts tend to use special terms (rare words) more often than usual and sometimes we feel that experts use long sentences, therefore utter longer words. So we asked ourselves: what about the high-frequency long words? Who is using them most often and in what context? We were not able to discover studies addressing explicitly the issue of the usage of the long words of high natural language frequency and decided to test this in our study. We need a threshold and also it is presumably flexible so we collected all the words of length higher than 8 characters and NLF norms higher than 20 per million. The whole corpus consists of 490 words altogether for the English language as given in the Brown corpus. During the automatic scan of the page, our script tokenizer (operating on an independent base – punctuation, space and tags) compared each long token with the words in our corpus and counted the occurrence of just the high-frequency long words. We observed that adjusting the threshold around 60 per million provides fairly good discrimination of the popularity description of a scientific topic. The words in the left column more often appeared in popular descriptions of issues related to science – theories, experimental results, terminology explanation, effects on human life and knowledge, and the like. Our cognitive science based heuristic has helped define a textual feature of the Web page to assist users find the appropriate expert level of description of the document behind the suggested link and get him closer to the feeling of satisfaction in the bookshop when the preferred style of writing inside a book is found. This is just the initial result of our attempt to consider the meaning of a word not just as an artifact of an automatic information retrieval system, but as a building element of a supervenient rule for knowledge discovery and meaning extraction from Web sites.
In popular texts explaining scientific results we find phrases like: “beautiful theory”, “interesting result”, “possible explanation”, whereas in deeper science writings we more often see phrases like “marginal significance”, “personality traits”, “research technique”, “laboratory equipment”. Our next step will be to test this more substantially in medical science and Web education. One reason not to take it for granted but to set it as a task for further tests is the need to outline the Gestalten – although the provided examples seem holistic enough. It will be interesting to see on stylistic and genre account the distribution of the random popular terms – our high-frequency long words – compared to the distribution of the holistic entities like “beautiful theory” or “marginal significance”. This is important to infer depth of knowledge in the site in general – or as a domain dependence issue. Another aspect is the flexibility of the threshold. In the second column we give examples of words with medium frequency norms from a wide range, which needs precision. Investigating this specific range of word norms may help define principles or rules for annotating scientific results at the appropriate level of knowledge by the researchers themselves for easier and better input to professional databases. An important necessary mechanism is to apply the readability filter in terms of the circularity of the given definitions, too. By investigating this continuum of possibilities to represent depth of knowledge we will attempt to bound our base (the scientific texts) to various levels of supervenient rules for lexical expressions to find the appropriate level for different communities of users in accordance with their knowledge needs. This approach does not exclude formal and mathematical algoritmization.
4 User Account of the Emergent Properties of Web Sites
4.1 Holistic understanding of aspects of Web genre [see Part III, forthcoming]