Abstract
The task of visuo-tactile object recognition is key in enabling robots to interact with humans and their environment in an efficient and effective manner. The differing statistical properties of visual images and tactile time-series data make visuo-tactile fusion non-trivial and complex. This work investigates the usage of Transformers to perform feature level fusion for visuo-tactile data, utilising the Transformer to generate temporal relationships between the visual and tactile data through its self-attention structure. The proposed pipeline is tested on the PHAC-2 dataset, and a complex ablation experiment is completed across a collection of leading activation functions. The proposed pipeline is demonstrated to achieve state-of-the-art accuracy for visuo-tactile object recognition on the PHAC-2 dataset, achieving a 94.3% accuracy when data from two tactile actions are considered.
Original language | English |
---|---|
Pages | 1-8 |
Number of pages | 8 |
DOIs | |
Publication status | Published online - 9 Sept 2024 |
Event | IEEE World Congress on Computational Intelligence - Japan Duration: 30 Jun 2024 → 5 Jul 2024 |
Conference
Conference | IEEE World Congress on Computational Intelligence |
---|---|
Abbreviated title | WCCI |
Period | 30/06/24 → 5/07/24 |
Data Access Statement
no info foundKeywords
- Object Recognition
- Multimodal Fusion
- transformer fires
- Visio-tactile Data