Hjelm, Hans (2009) Cross-language Ontology Learning. Incorporating and Exploiting Cross-language Data in the Ontology Learning Process. Academic dissertation for the Degree of Doctor of Philosophy in Computational Linguistics at Stockholm University, 2009.
Review by Marina Santini
The PhD thesis ”Cross-language Ontology Learning” presents a framework for automating cross-language ontology creation systems and suggests a setting in which cross-language data can be profitably integrated. The high-level task is to computerize the acquisition of semantic knowledge.
In Information Science, ontology is “a way of representing knowledge or structuring the terminology within a domain” (p. 14 ). The thesis focuses on the learning of domain ontologies and limits its scope to studying is-a hierarchies. Ontology learning is the automated acquisition of a domain terminology from raw natural language texts. That is, “given a collection of texts, the ontology learning task typically consists of first recognizing the relevant objects (words or terms) in the text collection and secondly ordering these into a hierarchical inheritance structure” (p. 17). Automatic ontology creation addresses two drawbacks of current handmade resources (such as WordNet and Roget’s Thesaurus): (1) the lack of coverage and (2) the lack of adaptability and update-ability.
The thesis provides three main contributions:
1. the development of a new evaluation measure to assess results and performance (Chapter 4);
2. a comparison of distributional similarity models and a statistical word alignment system for bilingual dictionary extractions, as well as the presentation of an ensemble method that combines the two approaches (Chapter 5);
3. the confirmation that: (1) the combination of various sources of information is effective and (2) cross-language data helps the ontology learning process (Chapter 6).
The thesis has seven chapters.
The opening chapter — Introduction — provides definitions, sets the boundaries of the investigation and explains the challenges.
Chapter 2 — Ontology Learning Perspectives — is a comprehensive overview of different approaches for solving problems involved in cross-language-ontology learning, including the granularity of term units (single words, multiwords, non-adjacent multiword expressions) to the issues related with translational equivalence for terms.
Chapter 3 — Resources — presents the corpora (a.k.a. text collections) used in the experiments carried out to test the research hypotheses. Experiments are based on two document collections:
- JRC-ACQUIS Multilingual Parallel Corpus: this corpus consists of legal texts concerning matters involving the EU. The number of words per language varies between 6.5 million (Swedish) and 7.8 million (French) among the languages used in the experiments: German, French, English and Swedish. (pp. 51-52)• Wikipedia anatomy corpus.
- Wikipedia pages filed under the ‘Anatomy’ category for English, French, German and Spanish. This resulted in about 7,300 pages for English, 2,600 for French, 2,400 for German and 1,000 for Spanish. The corresponding number of words is about 4.4 million for English, 1.1 million for French, 890,000 for German and 400,000 for Spanish (pp. 52-53).
Evaluation is based on the following gold standard terminological ontologies:
- Eurovoc V4.28: a freely available multilingual thesaurus with entries in more than 20 languages. The thesaurus contains 6,645 concepts. (p.53)
- The Foundational Model of Anatomy (FMA) ontology: developed by the Structural Informatics Group at the University of Washington, and it is open source. It contains about 100,000 English terms, 8,000 Latin, 4,000 French, 500 Spanish and 300 German terms (p. 55)
Chapter 4 —Theoretical and Experimental Investigations Regarding Evaluation — reviews existing evaluation measures and also investigates the use of a proposed new evaluation measure, based on Pearson’s product-moment correlation coefficient (PMCC). “The idea behind the PMCC measure is that we can characterize an ontology by listing all pairs of concepts that it contains, along with the distance between each concept pair, measured in the number of edges between the two concepts” (p.65). The author uses five criteria to establish a good evaluation measure, namely (pp. 67-68):
1. Criterion 1 – Independent dimensions of evaluation:
2. Criterion 2 – Severe errors should have a higher impact on the measure than less severe ones:
3. Criterion 3 – A gradual decrease in correctness should result in a gradual decrease in the value of the evaluation measure:
4. Criterion 4 – Scaling:
5. Criterion 5 – Vertical and horizontal perspective:
Seven experiments are carried out to test the measures (pp. 66-76) and it turns out that the PMCC measure is the only measure tested here that meet all five criteria.
In Chapter 5 — Experiments with Identifying Cross-language Term Equivalency — the author investigates methods for automatically identifying translational equivalents among terms from different languages. The experiments presented in this chapter deal with the identification of cross-language term equivalents. Given a collection of domain-specific documents, the goal is to identify the textual units that constitute the terms of the domain (i.e. the terminology of the document collection).
Chapter 6 — Experiments in Ontology Learning — focuses on the exploitation of cross-language resources. More specifically, the author emphasizes that “further qualitative improvements are possible, by taking a cross-language perspective on the problem of recognizing hyperonymy and cohyponymy, and that automatic word alignment techniques can be used to, at least partially, achieve these improvements, in the absence of a domain-specific multilingual dictionary” (p. 124). Results appear to be encouraging, “though there is still plenty of qualitative and methodological improvement that can and should be made, before the system can start performing on a human level, in terms of accuracy (p. 135).
Chapter 7 — Conclusions — is a concise summary that lists research questions and their answers.
The thesis ”Cross-language Ontology Learning” is interesting, solid, clear and well-written. The author tackles a difficult and complex task, the automatic creation of cross-language ontologies, and analyses the aspects concurring to such a creation, as well as all the different computational approaches proposed up to 2009.
Remarkably, the author introduces a new evaluation measure, PMCC, and confirms the hypothesis that cross-language data contribute to improve the learning.
The thesis is inspirational in many respects — from the experiments related to translational equivalence and word alignment (pp. 41-46 and 81-95) to cross-language SVM classification (pp. 118-124) — and is a recommended reading for those working with ontologies, taxonomies, terminology, machine translation or more generally with the acquisition of semantic knowledge.
The bulk of experiments and contributions presented in Hans Hjelm’s thesis is now included in a book (Chapter 14) edited by Wilson Wong, Wei Liu and Mohammed Bennamoun: Ontology Learning and Knowledge Discovery Using the Web, published by IGI Global, May 31, 2011.