Skip to main navigation Skip to search Skip to main content

Assessing the Effect of Data Complexity and Instance Overlap Issues on Imbalanced Learning

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Most machine learning (ML) algorithms work best when the samples in each class are almost equal. However, if a dataset has imbalanced samples then the ML model can achieve a high accuracy by just predicting the majority of classes and achieving a high classifier performance. Different resampling methods are used to handle this imbalance issue by creating synthetic data in the minority class or removing samples from the majority class until the classes are balanced. However, using these resampling methods can harm the dataset performance by creating synthetic data samples in a dataset that already suffers from overlap and other data intrinsic issues. Recently, dataset complexity measures, by observing the characteristics of a dataset, are often used in machine learning tasks as intrinsic descriptors to calculate the difficulty of a classification problem. This study investigates how resampling methods can affect the dataset in terms of complexity, overlapping, and classification accuracy. This is achieved by monitoring 22 different measurements related to data complexity measures obtained from 20 public and pre-processed datasets from the KEEL repository. Our empirical findings demonstrate a strong positive correlation between the resampling methods and the dataset's complexity and classification accuracy. We also posit that the main reason for poor classification accuracy on an imbalanced dataset is not due to its imbalanced nature but due to the other intrinsic characteristics of the dataset. Finally, we advocate that pre-processing methods and ML algorithms should be based on the dataset's specific properties, rather than being chosen on an ad hoc basis.
Original languageEnglish
Title of host publicationICBDE 2024 - 2024 the 7th International Conference on Big Data and Education
Pages49-56
Number of pages8
ISBN (Electronic)979-8-4007-1698-0
DOIs
Publication statusPublished (in print/issue) - 24 Jan 2025
Event
ICBDE 2024: 2024 the 7th International Conference on Big Data and Education
- Oxford United Kingdom
Duration: 24 Sept 202426 Sept 2024
https://dl.acm.org/doi/proceedings/10.1145/3704289

Publication series

NameICBDE 2024 - 2024 the 7th International Conference on Big Data and Education

Conference

Conference
ICBDE 2024: 2024 the 7th International Conference on Big Data and Education
Period24/09/2426/09/24
Internet address

Bibliographical note

Publisher Copyright:
© 2024 Copyright held by the owner/author(s).

Keywords

  • Classification
  • Dataset Complexity
  • Imbalanced Datasets
  • Overlapping Instances
  • resampling

Fingerprint

Dive into the research topics of 'Assessing the Effect of Data Complexity and Instance Overlap Issues on Imbalanced Learning'. Together they form a unique fingerprint.

Cite this