The WebGenre Blog: The power of genre applied to digital information. By Marina Santini » Entries tagged with "noise"

Lecture 2: Understanding and Preprocessing Data (ML4LT 2015)

Lecture 2: Understanding and Preprocessing Data (ML4LT 2015)

Lecture 2: Understanding and Preprocessing Data Basic Statistics & Graphs. Lecture 2: Preliminaries (Understanding and Preprocessing data) from Marina Santini … Read entire article »

Filed under: featured, lectures

Lecture 1: What is Machine Learning (ML4LT 2015)

Lecture 1: What is Machine Learning (ML4LT 2015)

Opening lecture to the Machine Learning for Language Technology courseat Uppsala University, Sweden. Autumn 2015. Lecture 1: What is Machine Learning? from Marina Santini … Read entire article »

Filed under: featured, lectures

Lecture 2: Basic Concepts in Machine Learning for Language Technology

Machine Learning for Language Technology 2014 – Course Schedule … Read entire article »

Filed under: featured, lectures

White Paper: Automatic Genre Identification – Testing with Noise

Automatic Genre Identification – Testing with Noise by Efstathios Stamatatos, Serge Sharoff, Marina Santini – Copyright © 2012, All rights reserved.   Citation:  Stamatatos E., Sharoff S., Santini M. (2012). Automatic Genre Identification – Testing with Noise. [White paper]. Retrieved from http://www.forum.santini.se/2012/03/white-paper-automatic-genre-identification-testing-with-noise/ The genre collections used in the experiments are available here. The reference list is here. In the experiments described below, genre classes coming from three genre collections have been used: Santinis7 (Santini, 2007). KI-04 (Meyer zu Eissen and Stein, 2004), and HGC (Stubbe and Ringlstetter, 2007). These genre collections have been created by different people, in different universities, for different purposes, with different criteria, and different notions of what genre is. Since genre is a complex concept and genre classes can be characterized in different ways, we assume that having a AGI algorithm … Read entire article »

Filed under: collaborative blogging, computational models, featured, signed posts, white papers

AGI: Structured and Unstructured Noise

How would you handle automatic text classification in noisy conditions? This is what has been done, to my knowledge, in Automatic web Genre Idintefication (AGI). By noise here I refer to two different disturbing factors*: 1) the training sample and test sample come from different sources/annotators; 2) the test set contains genre classes that are not present in the training set. These two types of noise reflect the following real-world conditions when working with genre, namely: 1) since genre is a complex notion that has been interpreted in different ways, the identification of same genre class can vary depending on the research agenda or individual preferences; 2) we cannot possibly conceive a genre classifier that has a good performance if we include all existing genres either on the web or in … Read entire article »

Filed under: dialectic, discussions, overviews

Excerpt: Cross-Testing a Genre Classification Model for the Web

Cross-Testing a Genre Classification Model for the Web by Marina Santini In: Genres on the Web Computational Models and Empirical Studies Alexander Mehler, Serge Sharoff and Marina Santini Text, Speech and Language Technology Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9 Abstract The main aim of the experiments described in this chapter is to explore how to assess the robustness of genre models for the web. For this purpose, a simple genre model is presented and cross-tested with four genre collections. In this difficult experimental setting, the model shows some stability and its results are in line with other current genre-enabled applications. The model provides some insights into open issues in AGI on the web. In particular, it shows that we know very little about the effect of noise on genre classification results. The set of experiments presented here offers … Read entire article »

Filed under: chapter excerpts