ICASSP 2025 SALMA Workshop

About the SALMA Workshop

The rapid advancement of foundational large language models (LLMs) has transformed multiple domains by significantly boosting performance across a wide range of downstream tasks. Furthermore, in recent years, LLMs have been increasingly utilized in foundational speech and audio processing tasks such as ASR, Audio Captioning, etc., as well as in the development of new and innovative tasks such as open-ended Question Answering. However, despite the growing interest in this area, the adoption of LLMs for speech and audio tasks has been slower due to several challenges. These challenges include the limited availability of high-quality data, especially in non-English languages, the absence of comprehensive evaluation metrics, and the need for improved architectures and training methodologies that can effectively address the unique complexities of speech and audio processing.

The first Workshop on Speech and Audio Language Models (SALMA), co-located with ICASSP 2025, is focused on exploring how Large Language Models (LLMs) can be utilized to advance speech and audio processing. This workshop aims to bring together researchers specializing in speech, audio, and language models to foster in-depth discussions and identify synergies. The goal is to develop effective methodologies for leveraging LLMs to improve performance across various tasks in speech, audio, and music domains, including classification, generation, and retrieval. The workshop will also address fundamental questions such as:

How can we effectively integrate LLMs into existing speech and audio processing pipelines to improve the performance of downstream speech and audio processing tasks?

What are the innovative applications of (L)LMs in the realm of audio processing, such as text-guided audio generation and segmentation, and how can these be further developed?

What advancements in neural network architectures can facilitate better performance of multimodal and cross-modal (L)LMs in speech and audio processing tasks?

What novel training algorithms and data sources (real and synthetic) can be leveraged to enhance the capabilities of LLMs in speech and audio processing?

How can we design better evaluation methods and benchmarks to accurately measure the various capabilities of speech and audio language models?

What synergies can be identified through collaboration among researchers in the fields of speech, audio, and (L)LMs to drive innovation and address the current challenges in the domain?

Invited Speakers