News

DCASE Workshop and Freesound Day

The Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop takes place at the Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, Oct.29-31, 2025 (including BioDCASE that focuses on bio-acoustics on Oct.29). DCASE is an annual challenge and international conference on computational audio scene analysis and environmental audio AI. Prof. Shuo Zhang participates in DCASE 2025 in Barcelona, where he serves as the co-chair of industrial liaisons and a visiting researcher at the MTG from October 20 to October 31.

The Freesound 20th anniversary celebration event also takes place in the same week on October 28, 2025. Freesound.org is a premier platform for crowd sourced audio and sound recordings / sound effects sharing platform since 2005. In recent years, it has become an indispensable resource for numerous audio datasets in deep learning and audio AI, including the FSD50k dataset. The Freesound Day will include talks by the Freesound team and members of the community. Participants will share their personal and professional experiences with the platform, as well as highlight various projects that have emerged around it. There is also a composition competition using Freesound.

PhD thesis defense at MTG, UPF

Benno Weck (Music Technology Group, Universitat Pompeu Fabra) is presenting his doctoral thesis defense for PhD in Information and Communication Technologies on Oct 27, 2025, in Barcelona, Spain. Prof.Shuo Zhang is a thesis committee member in attendance. The thesis is entitled “Content-based retrieval in large-scale audio collections with natural language as the interface”. It is a body of research centered on the intersection between natural language processing and audio/music AI. This PhD thesis is supervised by Prof. Xavier Serra of MTG, UPF.

Abstract: Audio collections, ranging from music archives to environmental sound libraries, have been growing quickly. However, these vast resources remain largely underutilised due to sparse metadata and limited search capabilities. This thesis investigates content-based retrieval in large-scale audio collections using natural language as the interface, with the goal of enabling more intuitive and expressive access to audio content. We address three central challenges: system design, data availability, and evaluation.  For system design, we explore two primary directions. First, in audio captioning, we compare combinations of pretrained word embedding and machine listening models within a Transformer-based architecture. Second, in language-based retrieval, we investigate fine-tuning strategies for pretrained encoder models in a bi-encoder setup, considering different loss functions and the effects of augmenting training data with noisy audio-text pairs. To address the scarcity of paired text-music data, we introduce two novel datasets: Song Describer, a crowd-sourced collection of music captions, and WikiMuTe, which pairs music audio with encyclopedic textual descriptions. These datasets provide new resources for both evaluating and training multimodal models. In our evaluation work, we identify data leakage issues in an existing benchmark and propose more realistic dataset splits. We also introduce MuChoMusic, a multiple-choice question-answering benchmark designed to assess music understanding in multimodal models. Additionally, a user study explores how system constraints shape natural language query behaviour, revealing a tendency toward short queries despite a willingness to provide more detailed input. Together, these contributions aim to advance the integration of natural language and audio understanding and lay the foundations for richer interaction with audio content.

Visionular CEO Zoe Liu Tufts ECE talk

Dr. Zoe Liu, CEO of Visionular

On September 26, 2025, Prof.Shuo Zhang invited Visionular CEO Dr.Zoe Liu to give a talk at the Tufts ECE Seminar (10:30-noon, JCC170), hosted by Prof. Yingjie Lao and AIDA collaborator, ECE PhD student Rui Chu. 30+ students and faculty from ECE, DA, and others attended the lecture, preceded/followed by faculty meeting and lunch with Rui Chu, Prof. Shuchin Aeron, Prof. Shuo Zhang and Prof. Peter Lu.

New publication on trustworthy AI

In today’s world, we are surrounded by a variety of Autonomous AUdio Systems (AAUS). Audio capture may contain personal identity attributes of voice and speech that are protected in the scope of individual rights. This data can also be misused by attackers, highlighting key issues of data protection, privacy, and security. AAUS encompasses the wide breadth of devices and systems that are enabled for audio and speech capture, such as home devices, smart watches, virtual assistants, and audio-enabled vehicles.

Prof. Shuo Zhang collaborated with Prof. Jennifer Williams (University of Southampton)’s team in a project in the Trustworthy Autonomous Systems (TAS) Hub under the UKRI in a project “Co-Design of Context-Aware Audio Capture” (link1, link2).  In this project, the team co-designs,  implements and administers an interactive listening survey wherein participants may choose how to modify audio in various contexts and imagined scenarios where trust is involved. From our analysis, we investigate ways that individual protections and trust of AAUS could be increased.

A new publication this month in the Journal of Computer Speech and Language entitled “Public perceptions of speech technology trust in the United Kingdom” summarized our findings. This is an extended version of our previous publication “Socio-Technical Trust For Multi-Modal Hearing Assistive Technology,” a blue sky paper published in the AMHAT workshop at the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) in Greece.

Tufts article on AIDA alum Dharva Khambholia

A recent Tufts article by Masie O’Brian, Using AI to detect AI, reported on the collaboration between the Prof. Shuo Zhang (AIDA) at Tufts and Prof. Jennifer Williams, University of Southampton, UK. The collaboration was part of the project funded by Responsible-AI (RAI) UK to develop responsible and trustworthy AI systems in safety critical communication systems, co-led by three investigators from the UK (Dr.Jennifer Williams), US (Dr. Peng Wei, George Washington University), and Australia (Dr.Zena Assaad, Australian National University). Tufts students and AIDA lab research assistants Dharva Khambholia and Zhou Zhou contributed to the research to create a synthetic speech data set targeted at anti-spoofing, advised by Prof Zhang and Prof Williams. This project culminated in three international workshops held in Washington DC, Sydney and London, where researchers and government policy makers got together to discuss responsible AI from a technical and policy perspective. Prof Shuo Zhang participated in the RAI UK workshop (London) and Southampton seminar in 2025. Dharva and Prof. Zhang also participated in the Washington DC workshop in 2024.