Abstract
In recent years, Video Anomaly Detection (VAD) has shifted from conventional appearance-based modeling to semantically driven frameworks empowered by LLMs. Traditional reconstruction- and prediction-based methods, relying on motion or appearance patterns learned from normal data, often misclassify previously unseen yet semantically normal events as anomalies. To address this limitation, we propose SOR-BDNet (Semantic-Optical Representation with Boundary Detection Network), an annotation-free multimodal VAD framework that jointly leverages visual appearance and motion dynamics to generate interpretable semantic representations at the frame level. Specifically, we employ RAFT to estimate dense motion fields and concatenate the resulting flow maps with RGB images to form unified spatiotemporal inputs. These fused representations are fed into a GPT-4o-based module that generates semantic captions capturing object semantics and motion cues. Anomalies are detected by measuring semantic deviations from a memory bank constructed from normal captions. To further refine temporal boundaries, we design a boundary refinement module that integrates visual continuity constraints with contrastive feature learning based on a Swin Transformer backbone. Extensive experiments on four challenging benchmarks—UCSD-Ped2, Avenue, ShanghaiTech, and UCF-Crime—demonstrate that SOR-BDNet achieves frame-level accuracies of 97.96%, 82.86%, 87.36%, and 85.64%, respectively. These results highlight the robustness and scalability of the proposed framework, while significantly improving interpretability and generalization across diverse real-world surveillance scenarios. The source code and pretrained models are available at https://github.com/syi-coder/SOR-BDNet-Semantic-Optical-Representation-for-Boundary-Aware-Video-Anomaly-Detection-with-GPT-4o.
| Original language | English |
|---|---|
| Pages (from-to) | 1-23 |
| Number of pages | 23 |
| Journal | ACM Transactions on Multimedia Computing, Communications, and Applications |
| Volume | 22 |
| Issue number | 5 |
| Early online date | 13 Mar 2026 |
| DOIs | |
| Publication status | Published online - 13 Mar 2026 |
Bibliographical note
© 2026 Copyright held by the owner/author(s).Funding
The authors gratefully acknowledge the financial support provided by the British Telecom Ireland Innovation Centre(BTIIC), funded by Invest Northern Ireland (Invest NI) and BT. This work was also supported in part by the Major Fundamental Research Project of Shandong Province under Grant No. ZR2024ZD03
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 16 Peace, Justice and Strong Institutions
Keywords
- Video Anomaly Detection
- Large Language Models
- Boundary Detection Network
- Optical Flow
Fingerprint
Dive into the research topics of 'SOR-BDNet: Semantic-Optical Representation for Boundary-Aware Video Anomaly Detection with GPT-4o'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver