PhD thesis reviewed by Marina Santini
Fredrik Olsson, Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora. Doctoral thesis, University of Gothenburg, 2008
Download thesis from this page: http://soda.swedish-ict.se/3518/
The PhD thesis “Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora” by Fredrik Olsson contains 13 chapters and an appendix with the base learner parameter settings. The Introduction unfolds the problem and the argument, and the remaining 12 chapters describe the Background (Part I, Chapters 2-5), presents the BootMark method ( Part II, Chapter 6), test the proposed method (Part III, Chapters 7-12) and summarize findings, experience, and viable future directions (Part IV, Chapter 13).
The thesis describes a bootstrapping method for named-entity recognition based on active learning — BootMark. The BootMark method spawns from the tension between users’ information needs and the actual possibilities of an information extraction sytetm, of which a named-entity recognizer is a fundamental component. This tension is created by an opposition: “On the one hand, a specific and unambiguously defined information need is a prerequisite for successful information extraction. On the other hand, this very specificity of the information need definition causes problems in adapting and constructing information extraction systems; any piece of information that falls outside a given definition of an information need will not be recognized by the system, simply because it does not look for such pieces.” (p. 1). Additionally, “the domain and genre to which named entity recognition has been applied has not been varied to a great extent. The data sets used often consists of news wire texts, transcribed broadcast data, or scientific texts.” (p. 14). Last but not least, for machine-learning-based information extraction systems, annotated data are a fundamental building block. But obtaining good annotated data for training is always a challenging task. Given all these limitations, the BootMark method aims at reducing the number of documents a human user has to annotate to facilitate the training of a named entity recognizer and explores the possibility of making the acquisition of marked-text easier in order to provide more flexibility to named-entity recognizers and to informarmation extraction system.
Technical contributions include (cf. pp. 6-7):
1. the definition and evaluation of a number of metrics for quantifying the uncertainty of a single learner;
2. the definition and evaluation of a number of metrics for quantifying decision committee disagreement;
3. a way of combining the results from two view classifiers in co-testing;
4. an intrinsic stopping criterion for committee-based active learning;
5. a strategy for deciding whether the predicted label for a given instance (a token in the context of a document) should be suggested as a label to the human annotator during pre-tagging with revision.
In Part I, Chapters 2 an 3 introduce named-entity recognition and general machine learning concepts, respectively. Chapter 4 explains active machine learning and Chapter 5 contains a survey of support for annotation processes.
Part II is the core of the thesis incudes only Chapter 6 that introduces and discusses BootMark and its three phases: seeding, selecting documents and revising. Five issues emerge that will be empirically tested:
1. base learner and task characteristics;
2. the constitution of the seed set;
3. actively selecting documents;
4. monitoring and terminating the learning process;
5. revision of system-suggested annotations.
Part III contains the experiments. More specifically, Chapter 7 introduces an experimental setting. Chapter 8 describes the first set of experiments that provide a baseline for the experiments to come. The experiments also include the description of parameter selection, the use of automatic feature set reduction methods, and, for the best base learner also the generation of learning curves visualizing its ability to learn as more data becomes available. Chapter 9 contains an investigation of the applicability of active machine learning for the purpose of selecting the document to annotate next based on those that have been previously marked-up. Chapter 10 addresses the issue of the constitution of the document set utilized for starting the bootstrapping process. Chapter 11 examines ways to monitor the active learning process. Chapter presents a discussion about the use of the named-entity recognizer learned during the bootstrapping phase for marking up the remainder of the documents in the corpus.
Finally, part IV ends the dissertation with a summary, conclusions, and future work.
The thesis is mostly descriptive. BootMark is well-explained and well-documented. Many details are available wth a breaddown of the settings in the Appendix. Chapter 6, with the detailed outline of BookMark and its three phases, is in the limelight, being the only chapter in Part II
My actual curiosity lies in the applicability and effectiveness of active learning for real-world applications. Since the author states that the experiments are considered as indicative of the plausibility of the BootMark method (p. 5), I wonder whether BootMark (or part of it) has been used as a component in Ethersource when Olsson joined Gavagai as Chief Data Officer. In one of Gavagai’s blog posts, namely “We don’t do training, we do learning“, active learning is not mentioned, but it would be interesting to know in which way “Learning is done on the fly”…
- “The biggest challenge with Big Data is to stop focusing on Big Data”
- Swedish Startup Gavagai – exclusive interview with Jussi Karlgren
- In Swedish: Big Data