Date & Time: April 7, 2:00 pm IST – 5:30 pm IST
Venue: HYDERABAD INTERNATIONAL CONVENTION CENTRE (HICC)
Room: MR1.04
Date & Time: April 7, 2:00 pm IST – 5:30 pm IST
Venue: HYDERABAD INTERNATIONAL CONVENTION CENTRE (HICC)
Room: MR1.04
Welcome Address: 15 mins (2:00 – 2:15 pm)
Invited Talk by Oriol Nieto on Sound Design with GenAI – 25 mins (2:15 – 2:45 pm) [slides]
Invited Talk by Zhuo Chen on Challenges in Speech and Audio Generation – 25 mins (2:45 – 3:15 pm) [slides]
Invited Talk by Bhuvana Ramabhadran on Multilingual Speech Representations – 25 mins (3:15 – 3:45 pm)
Tea Break: 15 mins (3:45 – 4:00 pm)
Poster Session: 90 mins (4:00 – 5:30 pm)
Time: 2:15 – 2:45 pm
Speaker: Oriol Nieto (Adobe Research)
This presentation explores the forefront of generative AI research for sound design at Adobe Research. I will provide an overview of Latent Diffusion Models, which form the foundation of our work, and introduce several recent advancements focused on controllability and multimodality. I will begin with SILA [1], a technique designed to enhance the control of sound effects generated through text prompts. Following this, I will present Sketch2Sound [2], a model that generates sound effects conditioned on both audio recordings and text. Lastly, I will examine MultiFoley [3], a model capable of generating sound effects from both silent videos and text. Throughout the talk, I will showcase a series of examples and demos to illustrate the practical applications and potential of these models, making the case that we are only beginning to unveil a completely new paradigm in how to approach sound design.
Oriol is a Senior Audio Research Engineer at Adobe Research, where he focuses on human-centered AI for audio creativity, encompassing everything from music to audiobooks, video editing, and sound design. He holds a PhD in Music Technology from the New York University, a Master's in Music, Science, and Technology from Stanford University, and a Master's in Information Technologies from Pompeu Fabra University. Highly involved with the ISMIR community, he was one of the three General Chairs for ISMIR 2024 in San Francisco this past November. Oriol has helped develop relevant open-source MIR packages such as MSAF, jams, mir-eval, and librosa; contributed to PyTorch; and plays guitar, violin, cajón, and sings (and screams) in his spare time.
Time: 2:45 – 3:15 pm
Speaker: Zhuo Chen (ByteDance)
Speech and audio generation has witnessed unprecedented advancement in the past two years, transforming how we approach human-machine communication. Various methods have yielded remarkable results, blurring the boundaries between synthetic and natural voice experiences. In this talk, I will first introduce our findings in audio generation through the seed-audio series, demonstrating how these innovations have overcome previous technical barriers. Next, I'll discuss the inherent limitations of traditional modular speech processing in dialogue systems and why they fall short of truly natural interaction. Building on these insights, I'll introduce our recent end-to-end voice interaction model, highlighting discoveries that emerged during development. The presentation will conclude with a discussion of current challenges in developing human-like conversation systems.
Zhuo Chen is a Research Manager at ByteDance, where he leads a team specializing in voice interaction systems. He received his PhD from Columbia University in 2017 and subsequently served as a Principal Applied Scientist at Microsoft before joining ByteDance. Throughout his career, Dr. Chen has contributed significantly to the field of speech technology, with over 150 published research papers and patents. His work spans diverse speech processing domains, including speech recognition and translation, speech separation and enhancement, speaker identification and diarization, multi-channel processing and beamforming, speech self-supervised learning, and speech generation.
Time: 3:15 – 3:45 pm
Speaker: Bhuvana Ramabhadran (Google DeepMind)
Machine Learning continues to produce models that can scale and solve multilingual, speech and language understanding tasks. Self-supervised learning, first introduced in the field of computer vision, is used to refer to frameworks that learn labels or targets from the unlabeled input signal. In other words, self-supervised learning makes use of proxy supervised learning tasks, such as contrastive learning to identify specific parts of the signal that carry information, thereby helping models to learn robust representations. Recently, self-supervised (pre-training) approaches have gained popularity and become key to the representations learnt by foundational models capable of addressing several tasks in many languages. Multilinguality and code-switching, common in multilingual societies, pose several challenges for speech and language processing. This talk addresses the following questions: Is there a joint latent and robust representation of multiple modalities that can help multilingual speech and language understanding? Are there unsupervised techniques to address languages with scarce data resources? Can this type of cross lingual transfer aid in zero-shot learning with these representations?
Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google Deepmind, focusing on semi-supervised learning for speech recognition and multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning.