Learning-based Data Augmentation for Multiclass Data
General Material Designation
[Thesis]
First Statement of Responsibility
Al Olaimat, Mohammad
Subsequent Statement of Responsibility
Kim, Jinoh
.PUBLICATION, DISTRIBUTION, ETC
Name of Publisher, Distributor, etc.
Texas A&M University - Commerce
Date of Publication, Distribution, etc.
2019
GENERAL NOTES
Text of Note
73 p.
DISSERTATION (THESIS) NOTE
Dissertation or thesis details and type of degree
M.S.
Body granting the degree
Texas A&M University - Commerce
Text preceding or following the note
2019
SUMMARY OR ABSTRACT
Text of Note
A multiclass dataset is a dataset that contains three or more classes in terms of classification. As a result, many classification models have been developed to detect such classes. Classification models need a dataset for the training phase, and this dataset sometimes has many classes. Some of the important challenges for these classification models are (i) some training datasets are not well balanced because some classes are less represented than others, and (ii) some training datasets have a small set of data making it hard to represent the entire distribution. Using an unbalanced dataset or small dataset for training classification models leads to reducing the ability to detect these minor classes in the testing dataset because these minor classes may be treated as noises by the classifier, which may consequently reduce the efficiency in the classification. This research aims to make such unbalanced datasets to be representative by augmenting minor classes and enlarging small datasets to increase the performance of the classification. In particular, this research takes an approach using deep learning for augmenting. Generative adversarial networks (GANs) are one of the deep learning techniques, and this research develops a methodology to extend the data using GANs effectively. The methodology includes the following steps: generating synthetic data for the minor class using GANs based on locality augmentation strategy, augmenting the synthetic data, training a classification model using augmented dataset, and testing the classification model. Through the evaluation using the public network connection datasets, we observed that the proposed technique enhances the performance for identifying anomalies in the network up to 9% in terms of the classifier accuracy and 10% in terms of the F1-Score when the minor classes represent 3% of all classes in the training dataset.