Dissemination: Multi-Labeling Web Pages by Genre

Excerpts from: Chaker Jebari. MLICC: A Multi-Label and Incremental Centroid-Based Classification of Web Pages by Genre. NLDB 2012: 183-190. For the full version, please contact: jebarichaker@yahoo.fr


In our approach we used the corpus MGC. This corpus was gathered from internet and consists of 1539 English web pages classified into 20 genres as shown in the following table. In this corpus each web page was assigned by labelers to primary, secondary and final genres. Among 1539 web pages, 1059 are labeled with one genre, 438 with two genres, 39 with three genres and 3 with four genres. It is clear from the following table that the corpus MGC is unbalanced, meaning that the web pages are not equally distributed among the genres.
In this paper we used the average precision (Precision), the ranking loss (RankLoss), One-error (OneError) and Hamming Loss (HamLoss) metrics (Read, Pfahringer, & Holmes, 2008). In our experiments, we followed the 10 × 10 cross-validation procedure which consists in randomly split the corpus into 10 equal parts. Then we used 9 parts for training and the remaining part for testing. This process is performed 10 times and the final performance is the average of the 10 individual performances.

Experiments and results
This section discusses the results provided by two experiments. The purpose of the first experiment is to identify the data source for which our approach achieves the best performance, whereas the objective of the second experiment is to show the importance of the incremental aspect in genre classification.

Experiment1: Effect of data source
In this experiment we compared the classification performance using different data sources: URL text (UT), title text (TT), heading text (HT), anchor text (AT), the combination of all the previous sources (CT) and the entire text of the web page (ET). It can be observed from Table 2 that results obtained using the combination of all character n-grams extracted from all data sources outperform those achieved using only character n-grams extracted from each data source alone. Moreover using character n-grams extracted from the entire text of the web page we achieved less results.

Experiment2: Effect of incremental classification
In this experiment we varied the percentage of training web pages from 10% to 90% by step of 10%. From Table 3, we can say that our approach requires a small set of training pages to achieve good results. This can proof that incremental classification of web pages leads to better results than batch classification. Moreover, we can see that the best results are reported with adjusted centroids rather than stable centroids.

Comparison with other multi-label classifiers
In this section, we compare our approach with other multi-label classification methods presented previously in section 3 and implemented in the Mulan toolkit1. These algorithms are Rakel, BR-SVM, MLKNN and BPMLL. To provide a valid comparison, we compared these algorithms to each other using the corrected re-sampled paired t-test with significance level of 5%. From Table 4, we can say that our classifier outperforms all multi-label classification algorithms.

The execution speed is also another important comparison aspect. Generally, the excecution speed is base on training and testing times. In this experiment, we compare our approach with all algorithms used before such as Rakel, BR-SVM, MLKNN and BPMLL in terms of training and testing time measured in seconds. The achieved results are presented in Table 5.

It is clear from Table 5 that our approach is the fastest. This result is obvious because we exploits only character n-grams extracted from specific web page elements, such as title, headings, links and URL rather than the entire text of the web page. The method that takes longer for training is BPMLL, because they are the ones that apply the most complicated transformation to the training web pages.
For the full version, you can contact: jebarichaker@yahoo.fr
Chaker Jabari’s Publications
Related article
Marina Santini : Zero, single, or multi? Genre of web pages through the users’ perspective. Inf. Process. Manage. (IPM) 44(2):702-737 (2008)


1 comment for “Dissemination: Multi-Labeling Web Pages by Genre

  1. Marina Santini
    26 September, 2012 at 09:48

    Discussion on Enterprise Architecture: Tactical. Strategic. Visionary. — LinkedIn Group

    Jan Jasik • Who does genre classification? Does it follow its own ontology or it is a result of mean, medium or intentions calculation?

    Jan Jasik • Does multi-labeling require a conscious observer?

    Marina Santini • The current approach to automatic genre classification is to:

    1) create a corpus (ie a document collection) where each document has been genre-annotated by humans

    2) apply single-label machine learning algorithms to the genre corpus to create a computational model capable of applying genre labels to unclassified documents automatically (one genre per document)

    Unfortunately, a single document does not NECESSARILY belong to a single genre. Many documents belong to many genres simultaneously (ex. a newsletter can be belong to the following genres: newsletter, schedule, summary, list, etc. according to different human raters). For this reason, it would be ideal to have a computational model that can apply as many genre labels as needed to a single document.

    Cheers, Marina

    Jan Jasik • So, multi-labeling requires a conscious observer and even then labeling can be subjective… different observer, more inclusive context, historical perspective… it can get complicated as dynamic labeling: intentions, intelligence, visual representation modeling…?

    Marina Santini • Intriguing, I would say 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *