A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION

Michel Olvera; Paraskevas Stamatiadis; Slim Essid

Communication Dans Un Congrès Année : 2024

A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION

(1, 2) , (1, 2) , (1, 2)

1
2

Michel Olvera

Fonction : Auteur
PersonId : 1416347
IdHAL : michel-olvera

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Paraskevas Stamatiadis

Fonction : Auteur
PersonId : 1416367
IdHAL : paraskevas-stamatiadis

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Slim Essid

Fonction : Auteur
PersonId : 181234
IdHAL : slimessid
ORCID : 0000-0002-0028-327X
IdRef : 11025130X

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Résumé

Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as "this is a sound of" followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot.

Mots clés

Zero-shot audio classification audio-text models contrastive language-audio pretraining in-context learning

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

main.pdf (103.3 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Paraskevas Stamatiadis : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04701759

Soumis le : mercredi 18 septembre 2024-17:32:31

Dernière modification le : mercredi 23 octobre 2024-10:30:04

Dates et versions

hal-04701759 , version 1 (18-09-2024)

Identifiants

HAL Id : hal-04701759 , version 1

Citer

Michel Olvera, Paraskevas Stamatiadis, Slim Essid. A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION. DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events, Oct 2024, Tokyo, Japan. ⟨hal-04701759⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

LTCI IDS S2A IP_PARIS INSTITUT-MINES-TELECOM

382 Consultations

35 Téléchargements

A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager