Prediction performance improvement for highly imbalanced monitoring data

Yuhua Li, Liam Maguire, Michael McCann, Adrian Johnston

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In engineering applications, we often face highly imbalanced data problems where majority of the data are from a condition and small minority are from others. Directly learning classifier on such problems would be prone to a biased classification performance by the majority class, so resulting in poor predication on the minority class. This paper proposes a method for balancing training data, which over-samples the minority class. The method uses between-class and within-class information to decide the vicinity space of an example. It generates synthetic examples along orthogonal directions in the vicinity, so it ensures the generated synthetic examples well represent the entire vicinity space and be more similar to minority class than majority class. The method is easy to use, as it involves no parameter setting. A real world problem of semiconductor manufacturing line monitoring and process control data is used to demonstrate that classification performance can be significantly improved through learning on balanced data by the proposed method.
LanguageEnglish
Title of host publicationUnknown Host Publication
Number of pages8
Publication statusPublished - 22 Jun 2010
EventThe 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies - Stratford-upon-Avon
Duration: 22 Jun 2010 → …

Conference

ConferenceThe 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies
Period22/06/10 → …

Fingerprint

Monitoring
Process control
Classifiers
Semiconductor materials

Cite this

Li, Y., Maguire, L., McCann, M., & Johnston, A. (2010). Prediction performance improvement for highly imbalanced monitoring data. In Unknown Host Publication
Li, Yuhua ; Maguire, Liam ; McCann, Michael ; Johnston, Adrian. / Prediction performance improvement for highly imbalanced monitoring data. Unknown Host Publication. 2010.
@inproceedings{45ec745fd6c84b3db3bcfbba3adaf203,
title = "Prediction performance improvement for highly imbalanced monitoring data",
abstract = "In engineering applications, we often face highly imbalanced data problems where majority of the data are from a condition and small minority are from others. Directly learning classifier on such problems would be prone to a biased classification performance by the majority class, so resulting in poor predication on the minority class. This paper proposes a method for balancing training data, which over-samples the minority class. The method uses between-class and within-class information to decide the vicinity space of an example. It generates synthetic examples along orthogonal directions in the vicinity, so it ensures the generated synthetic examples well represent the entire vicinity space and be more similar to minority class than majority class. The method is easy to use, as it involves no parameter setting. A real world problem of semiconductor manufacturing line monitoring and process control data is used to demonstrate that classification performance can be significantly improved through learning on balanced data by the proposed method.",
author = "Yuhua Li and Liam Maguire and Michael McCann and Adrian Johnston",
year = "2010",
month = "6",
day = "22",
language = "English",
isbn = "978-1-901892-33-8",
booktitle = "Unknown Host Publication",

}

Li, Y, Maguire, L, McCann, M & Johnston, A 2010, Prediction performance improvement for highly imbalanced monitoring data. in Unknown Host Publication. The 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies, 22/06/10.

Prediction performance improvement for highly imbalanced monitoring data. / Li, Yuhua; Maguire, Liam; McCann, Michael; Johnston, Adrian.

Unknown Host Publication. 2010.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Prediction performance improvement for highly imbalanced monitoring data

AU - Li, Yuhua

AU - Maguire, Liam

AU - McCann, Michael

AU - Johnston, Adrian

PY - 2010/6/22

Y1 - 2010/6/22

N2 - In engineering applications, we often face highly imbalanced data problems where majority of the data are from a condition and small minority are from others. Directly learning classifier on such problems would be prone to a biased classification performance by the majority class, so resulting in poor predication on the minority class. This paper proposes a method for balancing training data, which over-samples the minority class. The method uses between-class and within-class information to decide the vicinity space of an example. It generates synthetic examples along orthogonal directions in the vicinity, so it ensures the generated synthetic examples well represent the entire vicinity space and be more similar to minority class than majority class. The method is easy to use, as it involves no parameter setting. A real world problem of semiconductor manufacturing line monitoring and process control data is used to demonstrate that classification performance can be significantly improved through learning on balanced data by the proposed method.

AB - In engineering applications, we often face highly imbalanced data problems where majority of the data are from a condition and small minority are from others. Directly learning classifier on such problems would be prone to a biased classification performance by the majority class, so resulting in poor predication on the minority class. This paper proposes a method for balancing training data, which over-samples the minority class. The method uses between-class and within-class information to decide the vicinity space of an example. It generates synthetic examples along orthogonal directions in the vicinity, so it ensures the generated synthetic examples well represent the entire vicinity space and be more similar to minority class than majority class. The method is easy to use, as it involves no parameter setting. A real world problem of semiconductor manufacturing line monitoring and process control data is used to demonstrate that classification performance can be significantly improved through learning on balanced data by the proposed method.

M3 - Conference contribution

SN - 978-1-901892-33-8

BT - Unknown Host Publication

ER -

Li Y, Maguire L, McCann M, Johnston A. Prediction performance improvement for highly imbalanced monitoring data. In Unknown Host Publication. 2010