In our approach we used the corpus MGC. This corpus was gathered from internet and consists of 1539 English web pages classified into 20 genres as shown in the following table. In this corpus each web page was assigned by labelers to primary, secondary and final genres. Among 1539 web pages, 1059 are labeled with one genre, 438 with two genres, 39 with three genres and 3 with four genres. It is clear from the following table that the corpus MGC is unbalanced, meaning that the web pages are not equally distributed among the genres.
This section discusses the results provided by two experiments. The purpose of the first experiment is to identify the data source for which our approach achieves the best performance, whereas the objective of the second experiment is to show the importance of the incremental aspect in genre classification.
In this experiment we varied the percentage of training web pages from 10% to 90% by step of 10%. From Table 3, we can say that our approach requires a small set of training pages to achieve good results. This can proof that incremental classification of web pages leads to better results than batch classification. Moreover, we can see that the best results are reported with adjusted centroids rather than stable centroids.
In this section, we compare our approach with other multi-label classification methods presented previously in section 3 and implemented in the Mulan toolkit1. These algorithms are Rakel, BR-SVM, MLKNN and BPMLL. To provide a valid comparison, we compared these algorithms to each other using the corrected re-sampled paired t-test with significance level of 5%. From Table 4, we can say that our classifier outperforms all multi-label classification algorithms.
The execution speed is also another important comparison aspect. Generally, the excecution speed is base on training and testing times. In this experiment, we compare our approach with all algorithms used before such as Rakel, BR-SVM, MLKNN and BPMLL in terms of training and testing time measured in seconds. The achieved results are presented in Table 5.