Abstract
Text-based Person Search aims to retrieve images of pedestrians based on textual descriptions. The primary challenge lies in exploring intra-modal discrepancies while ensuring inter-modal consistency. Recently, Contrastive Language-Image Pre-training (CLIP) has attracted extensive attention for its powerful semantic understanding ability. However, existing approaches mainly rely on masked modeling tasks to achieve implicit fine-grained relationship learning. In contrast, we propose a CLIP-Driven Fine-grained Mining (CDFM) framework that explicitly mines fine-grained relationships without introducing external tools. CDFM extracts well aligned fine-grained representations while maintaining global alignment. Specifically, we first design an Attention Bias-based visual Forward (ABF) process that extracts local visual representations through unidirectional information transfer of local patch sets. Secondly, to obtain fine-grained text representations, we propose a Fine-grained Embedding Learning (FEL) module and a Text Extraction Strategy (TES). FEL uses a cross-attention mechanism to aggregate fine-grained text information into a set of learnable labels. To ensure the fine-grained text representations are well-aligned, TES uses the local visual representations as supervision, transferring cosine similarity and Euclidean distance knowledge from the pre-training space into the multimodal decoder. Experimental results demonstrate the superiority of our CDFM.
| Original language | English |
|---|---|
| Article number | 104741 |
| Pages (from-to) | 1-12 |
| Number of pages | 12 |
| Journal | Computer Vision and Image Understanding |
| Volume | 267 |
| Early online date | 18 Mar 2026 |
| DOIs | |
| Publication status | Published (in print/issue) - 30 Apr 2026 |
Bibliographical note
Publisher Copyright:Copyright © 2026. Published by Elsevier Inc.
Data Availability Statement
Data will be made available on request.Funding
This work was supported by a scholarship from Jiangsu University.
| Funders |
|---|
| Jiangsu University |
Keywords
- Cross-modality
- Fine-grained representation learning
- Text-based person search
Fingerprint
Dive into the research topics of 'CLIP-driven fine-grained mining for text-based person search'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver