Transforming IT operations: harnessing natural language processing and transformers in AIOps

  • Salman Ahmed

Student thesis: Doctoral Thesis

Abstract

Large-scale companies across various industries maintain extensive IT infrastructures to support their operations and provide high-quality service to both customers and employees. These IT operations are managed by specialized teams who handle incident reports, which may be generated automatically through autonomous systems or manually. Early detection of major IT incidents is crucial as it significantly reduces business operation disruptions and prevents severe outcomes like complete system shutdowns.

In this thesis, I present an empirical analysis of eleven state-of-the-art models designed to predict the severity of these incidents. I utilized an industry-led use case consisting of 500,000 records gathered over a one-year period from three distinct stakeholders: Agency, Customer, and Employee. Additionally, I reviewed the literature on IT incident tickets, particularly focusing on risk prediction.

This thesis investigates the effectiveness of advanced transformer models, including BERT, ERNIE, and RoBERTa, in overcoming challenges in IT incident data. These challenges include overfitting due to limited datasets and bias towards the majority class (minor incidents) in imbalanced data. Transformers excel at addressing both issues. Their unique ability to learn complex relationships within data allows them to handle larger datasets effectively. Additionally, their sophisticated text processing capabilities help mitigate bias by analyzing the context within incident reports. This contextual understanding enables transformers to identify major incidents more accurately, even when they are less frequent.

By leveraging this power, I aim to enhance the prediction accuracy of critical IT major incidents significantly. This allows for more accurate identification of major incidents even when they are less frequent. This section outlines the key contributions presented in this thesis:

• Literature Review and Gap Analysis:
– I conducted a systematic review of existing studies on IT incident risk prediction and identified a gap in comprehensive and systematic research.
– This thesis consolidates scattered knowledge in the domain, outlines the challenges, and discusses underexplored areas for further investigation.
– I assessed advanced tools and techniques for managing unstructured text data in IT incident records, highlighting the complexities and uncertainties involved.

• Predictive Alert System Development:
– In collaboration with an industrial partner, I mapped the provided data to develop a proactive predictive alert system.
– This system identifies the risk of major incidents early, enabling organizations to mitigate risks effectively.
– The framework not only minimizes the impact on IT operations and customer experiences but also ensures timely project delivery, enhances business reliability, optimizes organizational effectiveness, and reduces service costs.

• Severity Classification Framework:
– I introduced a framework that employs cutting-edge algorithms to classify and predict incident severity levels (high, medium, and low).
– This framework is expected to improve the speed and accuracy of IT incident management, facilitating more efficient and effective handling of incidents.

• Knowledge-Based System for ITSM:
– I developed a comprehensive system that automates two critical aspects of IT Service Management (ITSM): Ticket Assignment Group (TAG)and Incident Resolution (IR).
– This system streamlines the traditional ITSM process by bypassing steps such as data investigation, event correlation, situation room collaboration, and root cause analysis.
– The immediate automated solutions provided by this system not only conserve key performance indicator (KPI) resources but also significantly reduce the Mean Time to Resolution (MTTR).

Through these innovations, my thesis contributes significantly to the field of ITSM by leveraging state-of-the-art AI technologies to enhance the prediction, classification, and management of IT incidents. This approach strengthens incident handling protocols and supports the overall strategic objectives of IT operations in large-scale organizations.

Date of AwardJul 2024
Original languageEnglish
SupervisorMuskaan Singh (Supervisor), Damien Coyle (Supervisor) & Magda Bucholc (Supervisor)

Keywords

  • NLP
  • AI
  • AIOPS
  • DevOps
  • BERT
  • ERNIE
  • RoBERTa
  • transformers
  • LLMs
  • ITSM
  • IT service management
  • ServiceNow
  • IT ticket risk prediction

Cite this

'