Call for Papers

Workshop location

Date & Time: April 7, 2:00 pm IST – 5:30 pm IST

Venue: HYDERABAD INTERNATIONAL CONVENTION CENTRE (HICC)

Room: MR1.04

Program overview

Welcome Address 15 mins (2:00 – 2:15 pm)[slides]

Invited Talk by Oriol Nieto on Sound Design with GenAI – 25 mins (2:15 – 2:45 pm) [slides]

Invited Talk by Zhuo Chen on Challenges in Speech and Audio Generation – 25 mins (2:45 – 3:15 pm) [slides]

Invited Talk by Bhuvana Ramabhadran on Multilingual Speech Representations – 25 mins (3:15 – 3:45 pm)

Tea Break: 15 mins (3:45 – 4:00 pm)

Poster Session: 90 mins (4:00 – 5:30 pm)

Invited talks

Sound Design with GenAI [slides]

Time: 2:15 – 2:45 pm

Speaker: Oriol Nieto (Adobe Research)

Abstract

This presentation explores the forefront of generative AI research for sound design at Adobe Research. I will provide an overview of Latent Diffusion Models, which form the foundation of our work, and introduce several recent advancements focused on controllability and multimodality. I will begin with SILA [1], a technique designed to enhance the control of sound effects generated through text prompts. Following this, I will present Sketch2Sound [2], a model that generates sound effects conditioned on both audio recordings and text. Lastly, I will examine MultiFoley [3], a model capable of generating sound effects from both silent videos and text. Throughout the talk, I will showcase a series of examples and demos to illustrate the practical applications and potential of these models, making the case that we are only beginning to unveil a completely new paradigm in how to approach sound design.

Bio

Oriol is a Senior Audio Research Engineer at Adobe Research, where he focuses on human-centered AI for audio creativity, encompassing everything from music to audiobooks, video editing, and sound design. He holds a PhD in Music Technology from the New York University, a Master's in Music, Science, and Technology from Stanford University, and a Master's in Information Technologies from Pompeu Fabra University. Highly involved with the ISMIR community, he was one of the three General Chairs for ISMIR 2024 in San Francisco this past November. Oriol has helped develop relevant open-source MIR packages such as MSAF, jams, mir-eval, and librosa; contributed to PyTorch; and plays guitar, violin, cajón, and sings (and screams) in his spare time.

Advancing Speech and Audio Generation [slides]

Time: 2:45 – 3:15 pm

Speaker: Zhuo Chen (ByteDance)

Abstract

Speech and audio generation has witnessed unprecedented advancement in the past two years, transforming how we approach human-machine communication. Various methods have yielded remarkable results, blurring the boundaries between synthetic and natural voice experiences. In this talk, I will first introduce our findings in audio generation through the seed-audio series, demonstrating how these innovations have overcome previous technical barriers. Next, I'll discuss the inherent limitations of traditional modular speech processing in dialogue systems and why they fall short of truly natural interaction. Building on these insights, I'll introduce our recent end-to-end voice interaction model, highlighting discoveries that emerged during development. The presentation will conclude with a discussion of current challenges in developing human-like conversation systems.

Bio

Zhuo Chen is a Research Manager at ByteDance, where he leads a team specializing in voice interaction systems. He received his PhD from Columbia University in 2017 and subsequently served as a Principal Applied Scientist at Microsoft before joining ByteDance. Throughout his career, Dr. Chen has contributed significantly to the field of speech technology, with over 150 published research papers and patents. His work spans diverse speech processing domains, including speech recognition and translation, speech separation and enhancement, speaker identification and diarization, multi-channel processing and beamforming, speech self-supervised learning, and speech generation.

Multilingual Speech Representations

Time: 3:15 – 3:45 pm

Speaker: Bhuvana Ramabhadran (Google DeepMind)

Abstract

Machine Learning continues to produce models that can scale and solve multilingual, speech and language understanding tasks. Self-supervised learning, first introduced in the field of computer vision, is used to refer to frameworks that learn labels or targets from the unlabeled input signal. In other words, self-supervised learning makes use of proxy supervised learning tasks, such as contrastive learning to identify specific parts of the signal that carry information, thereby helping models to learn robust representations. Recently, self-supervised (pre-training) approaches have gained popularity and become key to the representations learnt by foundational models capable of addressing several tasks in many languages. Multilinguality and code-switching, common in multilingual societies, pose several challenges for speech and language processing. This talk addresses the following questions: Is there a joint latent and robust representation of multiple modalities that can help multilingual speech and language understanding? Are there unsupervised techniques to address languages with scarce data resources? Can this type of cross lingual transfer aid in zero-shot learning with these representations?

Bio

Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google Deepmind, focusing on semi-supervised learning for speech recognition and multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning.

List of Accepted Papers

Performance evaluation of SLAM-ASR: The Good, The Bad, The Ugly, and the Way Forward
Authors: Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke
StableTTS: Towards Efficient Denoising Acoustic Decoder for Text to Speech Synthesis with Consistency Flow Matching
Authors: Zhiyong Chen, Xinnuo Li, Shuhang Wu, Zhi Yang, Zhiqi Ai, Shugong Xu
USMID: A Unimodal Speaker-Level Membership Inference Detector for Contrastive Pretraining
Authors: Ruoxi Cheng, Yizhong Ding, cao shuirong, Shitong Shao, Zhiqiang Wang
MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Authors: Satvik Dixit, Soham Deshmukh, Bhiksha Raj
Musimple: A Simplified Music Generation System With Diffusion Transformer
Authors: Zheqi Dai, Haolin He, Qiuqiang Kong
Discrete Speech Unit Extraction via Independent Component Analysis
Authors: Tomohiko Nakamura, Kwanghee Choi, Keigo Hojo, Yoshiaki Bando, Satoru Fukayama, Shinji Watanabe
PAWS: A Physical Acoustic Wave Simulation Dataset for Sound Modeling and Rendering
Authors: Tianming Yin, Yiyang Zhou, Xuzhou Ye, Qiuqiang Kong
Indics2ST: Indian Multilingual Translation Corpus For Evaluating Speech-Large Language Models
Authors: Sanket Shah, Kavya Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga
Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
Authors: Han Yin, Yang Xiao, Jisheng Bai, Rohan Kumar Das
Closing the Loop on Speech to Music Translation: Automatically Generating Synthetic Percussive Sequences on the Mridangam from Konnakol
Authors: Gopika Krishnan, Julia Drabek, Akshay Anantapadmanabhan, Kaustuv Kanti Ganguli, Carlos Guedes
TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification
Authors: Nishit Anand, Ashish Seth, Ramani Duraiswami, Dinesh Manocha
A Suite for Acoustic Language Model Evaluation
Authors: Gallil Maimon, Amit Roth, Yossi Adi