Skip to main navigation Skip to search Skip to main content

CLIP-driven fine-grained mining for text-based person search

Research output: Contribution to journalArticlepeer-review

Abstract

Text-based Person Search aims to retrieve images of pedestrians based on textual descriptions. The primary challenge lies in exploring intra-modal discrepancies while ensuring inter-modal consistency. Recently, Contrastive Language-Image Pre-training (CLIP) has attracted extensive attention for its powerful semantic understanding ability. However, existing approaches mainly rely on masked modeling tasks to achieve implicit fine-grained relationship learning. In contrast, we propose a CLIP-Driven Fine-grained Mining (CDFM) framework that explicitly mines fine-grained relationships without introducing external tools. CDFM extracts well aligned fine-grained representations while maintaining global alignment. Specifically, we first design an Attention Bias-based visual Forward (ABF) process that extracts local visual representations through unidirectional information transfer of local patch sets. Secondly, to obtain fine-grained text representations, we propose a Fine-grained Embedding Learning (FEL) module and a Text Extraction Strategy (TES). FEL uses a cross-attention mechanism to aggregate fine-grained text information into a set of learnable labels. To ensure the fine-grained text representations are well-aligned, TES uses the local visual representations as supervision, transferring cosine similarity and Euclidean distance knowledge from the pre-training space into the multimodal decoder. Experimental results demonstrate the superiority of our CDFM.
Original languageEnglish
Article number104741
Pages (from-to)1-12
Number of pages12
JournalComputer Vision and Image Understanding
Volume267
Early online date18 Mar 2026
DOIs
Publication statusPublished (in print/issue) - 30 Apr 2026

Bibliographical note

Publisher Copyright:
Copyright © 2026. Published by Elsevier Inc.

Data Availability Statement

Data will be made available on request.

Funding

This work was supported by a scholarship from Jiangsu University.

Funders
Jiangsu University

    Keywords

    • Cross-modality
    • Fine-grained representation learning
    • Text-based person search

    Fingerprint

    Dive into the research topics of 'CLIP-driven fine-grained mining for text-based person search'. Together they form a unique fingerprint.

    Cite this