Abstract
Automatic text classification has gained importance with the increased availability of text data and the need for faster and more accurate extraction of knowledge from huge data sources. A challenging task in text classification is the effective representation of text. The features that are used to represent the document affect the performance of text classification. The traditional vector space model representation based on the term independence assumption considers a document as an unordered set of terms and their frequency-based weights. Although it is simple and fast, this representation does not consider structural information (order of words, relationship between words) or the semantics of text. An alternative to the vector space model for representing documents is structure-based representation. The different structure-based representations of text are sequences, trees and graphs. A document represented as a graph instead of a vector can retain its inherent structure, thereby increasing the classification performance. The major drawback of graph-based approaches for text classification is the increase in computational complexity. The goal of this research project is to increase the effectiveness and efficiency of text classification with enhanced graph-based representation, term weighting schemes and classification model.This thesis presents graph-based methods to represent and utilise the structural information in text documents. This involves the study of graph-based text representation models to capture the structural information in text and different methods to utilise this rich information for text classification.
Initially, a supervised graph-based term weighting scheme is developed that considerably improves the effectiveness of text classification. A graph-based framework for text classification is then introduced that represents each class as a graph (class graph) in order to utilise the structural information in the labelled training documents. Efficient graph-theoretic techniques such as network centrality measures and graph-decomposition techniques are used in the proposed text classification framework for supervised term weighting and also for graph reduction that eliminates the irrelevant terms. Structured regularization incorporates structural information into the learning process and reduces overfitting. A combination of structure-based supervised term weighting and regularization is proposed to consider structural information for term weighting and regularization. The semantic similarity and the co-occurrence information in the class graphs are utilised to identify topics for structured regularization.
A graph kernel-based approach for text classification is presented that focusses on building an effective semantic representation of text. A novel method is developed to automatically enrich the graphs using a word similarity matrix so that the similarity measure goes beyond exact matching of terms and relationships. As medical text documents contain complex terminology and it is important to handle the complex medical terms to understand the semantics of these documents for classifying them accurately, the proposed graph enrichment method is applied to build weighted concept graphs automatically from medical text documents. The graph-based approaches to text classification introduced in this thesis increase text classification performance and consistently outperform the baseline methods for text classification such as term frequency-based methods, state-of-the-art graph-based techniques, CNN and fastText on benchmark datasets.
Date of Award | Aug 2020 |
---|---|
Original language | English |
Sponsors | Vice Chancellor's Research Scholarship (VCRS) |
Supervisor | Zhiwei Lin (Supervisor), Glenn Hawe (Supervisor) & Hui Wang (Supervisor) |
Keywords
- Graphs
- Machine Learning
- Text Representation
- Text Classification