Abstract
Large-scale companies across various sectors maintain substantial IT infrastructure to
support their operations and provide quality services for their customers and employees. These IT
operations are managed by teams who deal directly with incident reports (i.e., those generated automatically
through autonomous systems or human operators). (1) Background: Early identification of
major incidents can provide a significant advantage for reducing the disruption to normal business
operations, especially for preventing catastrophic disruptions, such as a complete system shutdown.
(2) Methods: This study conducted an empirical analysis of eleven (11) state-of-the-art models to
predict the severity of these incidents using an industry-led use-case composed of 500,000 records
collected over one year. (3) Results: The datasets were generated from three stakeholders (i.e., agency,
customer, and employee). Separately, the bidirectional encoder representations from transformers
(BERT), the robustly optimized BERT pre-training approach (RoBERTa), the enhanced representation
through knowledge integration (ERNIE 2.0), and the extreme gradient boosting (XGBoost) methods
performed the best for the agency records (93% AUC), while the convolutional neural network
(CNN) was the best model for the rest (employee records at 95% AUC and customer records at
74% AUC, respectively). The average prediction horizon was approximately 150 min, which was
significant for real-time deployment. (4) Conclusions: The study provided a comprehensive analysis
that supported the deployment of artificial intelligence for IT operations (AIOps), specifically for
incident management within large-scale organizations.
support their operations and provide quality services for their customers and employees. These IT
operations are managed by teams who deal directly with incident reports (i.e., those generated automatically
through autonomous systems or human operators). (1) Background: Early identification of
major incidents can provide a significant advantage for reducing the disruption to normal business
operations, especially for preventing catastrophic disruptions, such as a complete system shutdown.
(2) Methods: This study conducted an empirical analysis of eleven (11) state-of-the-art models to
predict the severity of these incidents using an industry-led use-case composed of 500,000 records
collected over one year. (3) Results: The datasets were generated from three stakeholders (i.e., agency,
customer, and employee). Separately, the bidirectional encoder representations from transformers
(BERT), the robustly optimized BERT pre-training approach (RoBERTa), the enhanced representation
through knowledge integration (ERNIE 2.0), and the extreme gradient boosting (XGBoost) methods
performed the best for the agency records (93% AUC), while the convolutional neural network
(CNN) was the best model for the rest (employee records at 95% AUC and customer records at
74% AUC, respectively). The average prediction horizon was approximately 150 min, which was
significant for real-time deployment. (4) Conclusions: The study provided a comprehensive analysis
that supported the deployment of artificial intelligence for IT operations (AIOps), specifically for
incident management within large-scale organizations.
Original language | English |
---|---|
Article number | 3843 |
Pages (from-to) | 1-27 |
Number of pages | 27 |
Journal | Applied Sciences |
Volume | 13 |
Issue number | 6 |
Early online date | 17 Mar 2023 |
DOIs | |
Publication status | Published (in print/issue) - 17 Mar 2023 |
Bibliographical note
Funding Information:We are grateful for access to the tier 2 high-performance computing resources provided by the Northern Ireland High-Performance Computing (NI-HPC) facility, funded by the U.K. Engineering and Physical Sciences Research Council (EPSRC), Grant Nos. EP/T022175/ and EP/W03204X/1. Damien Coyle is supported by the UKRI Turing AI Fellowship 2021–2025 funded by the EPSRC (grant number EP/V025724/1). Salman Ahmed is supported by a George Moore Ph.D. scholarship.
Funding Information:
This work was supported by U.K. Research and Innovation Turing AI Fellowship 2021–2025 funded by the Engineering and Physical Sciences Research Council (grant number EP/V025724/1).
Publisher Copyright:
© 2023 by the authors.
Keywords
- IT incidents
- risk prediction
- dataset imbalance
- IT service management (ITSM)
- Information Technology Infrastructure Library (ITIL)
- Artificial Intelligence for IT Operations (AIOPS)
- artificial intelligence for IT operations (AIOps)