Abstract
Most machine learning (ML) algorithms work best when the samples in each class are almost equal. However, if a dataset has imbalanced samples then the ML model can achieve a high accuracy by just predicting the majority of classes and achieving a high classifier performance. Different resampling methods are used to handle this imbalance issue by creating synthetic data in the minority class or removing samples from the majority class until the classes are balanced. However, using these resampling methods can harm the dataset performance by creating synthetic data samples in a dataset that already suffers from overlap and other data intrinsic issues. Recently, dataset complexity measures, by observing the characteristics of a dataset, are often used in machine learning tasks as intrinsic descriptors to calculate the difficulty of a classification problem. This study investigates how resampling methods can affect the dataset in terms of complexity, overlapping, and classification accuracy. This is achieved by monitoring 22 different measurements related to data complexity measures obtained from 20 public and pre-processed datasets from the KEEL repository. Our empirical findings demonstrate a strong positive correlation between the resampling methods and the dataset's complexity and classification accuracy. We also posit that the main reason for poor classification accuracy on an imbalanced dataset is not due to its imbalanced nature but due to the other intrinsic characteristics of the dataset. Finally, we advocate that pre-processing methods and ML algorithms should be based on the dataset's specific properties, rather than being chosen on an ad hoc basis.
| Original language | English |
|---|---|
| Title of host publication | ICBDE 2024 - 2024 the 7th International Conference on Big Data and Education |
| Pages | 49-56 |
| Number of pages | 8 |
| ISBN (Electronic) | 979-8-4007-1698-0 |
| DOIs | |
| Publication status | Published (in print/issue) - 24 Jan 2025 |
| Event | ICBDE 2024: 2024 the 7th International Conference on Big Data and Education - Oxford United Kingdom Duration: 24 Sept 2024 → 26 Sept 2024 https://dl.acm.org/doi/proceedings/10.1145/3704289 |
Publication series
| Name | ICBDE 2024 - 2024 the 7th International Conference on Big Data and Education |
|---|
Conference
| Conference | ICBDE 2024: 2024 the 7th International Conference on Big Data and Education |
|---|---|
| Period | 24/09/24 → 26/09/24 |
| Internet address |
Bibliographical note
Publisher Copyright:© 2024 Copyright held by the owner/author(s).
Keywords
- Classification
- Dataset Complexity
- Imbalanced Datasets
- Overlapping Instances
- resampling
Fingerprint
Dive into the research topics of 'Assessing the Effect of Data Complexity and Instance Overlap Issues on Imbalanced Learning'. Together they form a unique fingerprint.Student theses
-
Improving reliability in the internet of things through anomaly detection
Moore, S. J. (Author), Zhang, S. (Supervisor), Nugent, C. (Supervisor) & Cleland, I. (Supervisor), Sept 2022Student thesis: Doctoral Thesis
File
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver