Semi-supervised document classification using ontologies
Aparicio Carrasco, Roxana K.
MetadataShow full item record
Many modern applications of automatic document classification require learning accurately with little training data. Addressing the need to reduce the manual labeling process, the semi-supervised classification technique has been proposed. This technique use labeled and unlabeled data for training and it has shown to be effective in many cases. However, the use of unlabeled data for training is not always beneficial and it is difficult to know a priori when it will be work for a particular document collection. On the other hand, the emergence of web technologies has originated the collaborative development of ontologies. Ontologies are formal, explicit, detailed structures of concepts. In this thesis, we propose the use of Ontologies in order to improve automatic document classification, when we have little training data. We propose that making use of ontologies to assist the semi-supervised document classification can substantially improve the accuracy and efficiency of the semi-supervised technique. Many learning algorithms have been studied for text. One of the most effective is Support Vector Machines, which is the basis of this work. Our algorithm enhances the performance of Transductive Support Vector Machines through the use of ontologies. We report experimental results applying our algorithm to three different real-world text classification datasets. Our experimental results show an increment of accuracy of 4% on average and up to 20% for some datasets, in comparison with the traditional semi-supervised model.