Mining Graph Patterns in Web-based Systems: A Conceptual View
by Matthias Dehmer and Frank Emmert-Streib
Abstract
This chapter discusses a graph-based perspective for automatically analyzing web genre data by mining graph patterns representing web-based hypertext structures. The major purpose of our contribution is to emphasize that an approach entirely different to the vector space model, frequently used in Web mining and related problems, can not only be applied to these problems but is more suitable conceptually. The graphs in our study are hierarchical and directed and are called generalized trees. Starting from a similarity measure for determining the structural similarity of generalized trees, we discuss some evaluation steps for automatically analyzing web genre data. Finally, connections for the application in Web Structure Mining and Web Usage Mining are indicated.
1 Introduction
The task of applying Data Mining methods to web-based hypertexts is often referred to as Web Mining. In view of the steadily increasing complexity of web data sources and the huge amount of information available online, Web Mining has been an important and fruitful research topic. Generally, Web Mining can be divided into the following categories:
1. Web Content Mining: Web Content Mining provides methods for automatically extracting information from web-based data sources. Important problems are data extraction and analysis by using, e.g., Text Mining methods.
2. Web Structure Mining: Web Structure Mining deals with exploring structural properties of web-based hypertexts, e.g., investigating internal and external link structures of web-based documents or exploring hypertext structure types using graph-based models. [Continue reading excerpts here or download PDF from here]