Benno Weck (Music Technology Group, Universitat Pompeu Fabra) is presenting his doctoral thesis defense for PhD in Information and Communication Technologies on Oct 27, 2025, in Barcelona, Spain. Prof.Shuo Zhang is a thesis committee member in attendance. The thesis is entitled “Content-based retrieval in large-scale audio collections with natural language as the interface”. It is a body of research centered on the intersection between natural language processing and audio/music AI. This PhD thesis is supervised by Prof. Xavier Serra of MTG, UPF.
Abstract: Audio collections, ranging from music archives to environmental sound libraries, have been growing quickly. However, these vast resources remain largely underutilised due to sparse metadata and limited search capabilities. This thesis investigates content-based retrieval in large-scale audio collections using natural language as the interface, with the goal of enabling more intuitive and expressive access to audio content. We address three central challenges: system design, data availability, and evaluation. For system design, we explore two primary directions. First, in audio captioning, we compare combinations of pretrained word embedding and machine listening models within a Transformer-based architecture. Second, in language-based retrieval, we investigate fine-tuning strategies for pretrained encoder models in a bi-encoder setup, considering different loss functions and the effects of augmenting training data with noisy audio-text pairs. To address the scarcity of paired text-music data, we introduce two novel datasets: Song Describer, a crowd-sourced collection of music captions, and WikiMuTe, which pairs music audio with encyclopedic textual descriptions. These datasets provide new resources for both evaluating and training multimodal models. In our evaluation work, we identify data leakage issues in an existing benchmark and propose more realistic dataset splits. We also introduce MuChoMusic, a multiple-choice question-answering benchmark designed to assess music understanding in multimodal models. Additionally, a user study explores how system constraints shape natural language query behaviour, revealing a tendency toward short queries despite a willingness to provide more detailed input. Together, these contributions aim to advance the integration of natural language and audio understanding and lay the foundations for richer interaction with audio content.