Towards a Computational Theory of Digital Genre (I): Working Definition of Genres for Computational Purposes
by Marina Santini – Last Updated: 29 Oct 2012
1. What is a (textual) genre?
• A genre is a class of texts with similar communicative, textual and linguistic features.
2. What characterizes a genre?
• Must have a name
• Must be recognized within a community
• Must be produced or retrieved during a task
• Must have conventions
• Must raise expectations
• Can change over time. It is an cultural artifact (culture here includes society, media, techonology, etc.)
3. What characterizes a digital genre?
• The same characteristics listed above.
• A digital genre is any kind of genre that has a digital form, such as emails, chats, online academic papers, online newspaper articles, blogs…
• A digital genre can be any paper genre converted into a digital form OR a class of texts that do not have any countepart in the paper world, such as home pages, About Us web pages, FAQs, webzine articles, personal blogs, corporate weblogs …
4. Genre Characterization: Ex: Recent but fully acknowledged genres
• Name: a genre must indicate a class, a family (for genre name formation, see Görlach, 2004). Recent digital genres: blogs, tweets, chatlogs, etc.
• Community: a genre is not something individual. A genre is a textual form that is used and recognized by a community (cf. personal style). Ex: Blogs àbloggers and blog readers; academic home pages à academics; etc.)
• Task: a genre meets a RECURRENT communication need. Ex: personal home page genre tells us something about a person; a technical blog gene is informative about some specific technology; etc.)
• Conventions: ex : a personal blog genre is made of posts organized in reversed chronological order where a blogger communicates personal and subjective views on some facts.
• Expectations: when reading a personal blog, readers expect to read something personal (personal facts or personal opinions) and expect the technical possibility to leave a comment, if they wish to do so.
• A genre is a cultural artifact: a genre might evolve over time (see: Weblogs: a history and perspective by Rebecca Blood, 2000) might disappear if the society changes (ex : Chansons des gestes). New genres emerge with new media, new technologies, new information needs.
5. Genre Characterization – Ex: A novel and fully emerged genre, the query log genre
• Name: in line with other digital genres (ex: web log à blog)
• Community: internet users, IR practitioners
• Task: information needs specified in a search engine
• Conventions: short texts written in”keywordese”
• Expectations: to find relevant information
• Cultural artifact: a product of our media-based, internet-based society OR a subproduct of search engines
6. The query log genre: Languistic and Textual Conventions
• Length: short text (a query log can be seen as a corpus of very short texts, shorter than tweets, mobile text messages, chat logs, etc.)
• Sublanguage/Jargon: ”keywordese”
• Register: neutral
• Morphology: LITTLE
• Syntax : OCCASIONALLY (usually no articles, no prepositions, no subclauses, etc.)
7. Query Log Genre: The Benefits
• Expressed in a ”lean” sublanguage, the keywordese:
• reduced morphology
• reduced syntax
• short texts
• Mostly Nouns and Verbs
• Reduced size: compare a 2-years collection of emails vs a 2-year collection of query logs
• = REDUCED SIZE, REDUCED PRE-PROCESSING; NO DATA CLEANING!
8. Query Log Genre: Expectations
• short texts written by users to find relevant information through a search engine.
• The texts (queries) must express information needs a.k.a. users’ intents.
• It is good practice to be cautious with the interpretation of users’ intents. However, If we mine query logs with a simple quantitative approach, it is possible to extract recurrent information needs and build upon them.
9. Why is a classification by genre beneficial for a computational approach?
• The main benefit is the contextualization of information! A genre is a CONTEXT carrier because it is based on recurrent conventions and predictable expectations. A genre provides the communicative context and the communicative purpose for which a text has been produced. The concept of genre is both a semantic and a pragmatic concept (i.e. it includes the semantic meaning + the situational/communicave context).
• Complexity reduction & e-Learning: a text receives identity throught belonging to a certain genre and and this identity reduces the cognitive effort.
• Information understanding & Forensic Linguistics: genre competence increases self-protection against digital crimes (such as fishing, hoaxes, cyberbullying and threats) because it can help spot genre anomalies and consequently malicious intentions
• Findability & Information Retrieval: since the membership of a document in a genre tells us something about the communicative context in which the document has been produced. From the communicative context, we can derive or infer or assess the relevance of this document to our information needs.
• Predictivity and Automatic Summarization + Text Summarization: being based on recurrent conventions and predictable expectations, it is possible to identify where the most important and relevant information is located within a document.
10. Genre is ubiquitous
• Language does not exist in abstract.
• Language use changes with the situation, purpose, audience, emotional state, etc.
• We might express the same meaning with different words according to different communicative contexts, using different genres according to the task, the audience, the purpose, etc.
I would appreciate your comments, thoughts and objections on this view of genre for computational purposes. Thanks in advance, Marina
***End of the post***
To be continued in the post: Towards a Computational Theory of Digital Genre (II): The Fuzzy boundaries of genre classes
Changes Log: 29 Oct 2012