MA-LLM25 - ACM MM 2025 Workshop

Scope

Getting insight in large multimedia collections is crucial in many domains. The emergence of Multimodal Large Language Models (MLLMs) has given an unprecedented boost in the accuracy and applicability of multimedia analysis. The primary way of interacting with those models, however, is still via text-based prompting or conversational agents. Text-based interaction adds an intermediate layer, thereby obfuscating the underlying data, which is a cumbersome way of getting insight from multimedia data.

Multimedia analytics, on the other hand, combines techniques from multimedia analysis, visualization, and data mining for extracting insights from large-scale multimedia collections. The synergetic interaction between expert and machine is crucial in this process as it allows for expert-driven extraction of rich and diverse insights. Visualizations can enable such interaction, by presenting scalable views of datasets, ranging from high-level summaries of collections to individual data-points, compact summaries of results, and possible navigation directions for exploration. Interactive visualizations combined with multimodal conversational agents, have the potential to significantly widen the communication channel between humans and MLLM, yielding much more effective ways of getting insight from the data.

Realizing this potential raises a number of questions for Multimedia Analytics, such as:

What types of visualizations are most suited for getting insight in large scale multimedia collections?
Can multimedia analytics be used to introduce notions of scale, currently missing in MLLMs?
How to best incorporate new modes of interaction into Multimedia Analytics systems to leverage the capabilities of MLLMs and address their limitations?
How to leverage interactively provided expert knowledge to improve the MLLM?
How much explainability by the model is needed to let the user make informed decisions?
What is the role of task-specialised interactive agents in application-driven research and how can we develop such agents?
How can we develop reinforcement learning based strategies to support users in getting to their goals?
What kind of system architectures are needed to work effectively with MMLMs in a multimedia analytics context?

ACM MM 2025 Workshop on Multimedia Analytics with Multimodal Large Language Models

News

Scope