Beyond Language: Multimodal Semantic Representations


IWCS 2021


Important Dates




MMSR I will be held in conjuction with IWCS 2021, on June 16 (16:00-20:00 Central European Time). The workshop will combine formal paper presentations with sessions discussing themes and key challenges in multimodal semantic representations.

The workshop program is now available!

Invited Talks

We are delighted to have Virginia Volterra of CNR-Institute of Cognitive Sciences and Technologies and Matthias Scheutz of the Tufts University HRI Lab as invited speakers!

From action to language through gesture

Virginia Volterra and Chiara Bonsignori, ISTC, CNR, Rome

This talk presents developmental evidence on continuity from action, to gesture to sign and word. When children use a word, they re-construct in some form the sensory and motor information they experienced with the referent. The basic formational components and the main depicting strategies observed in studies of adult gesture and sign research are already present in the representational gestures of two-year-old hearing children acquiring spoken language from different cultural and linguistic groups. Representational strategies for depicting information about objects and events make visible different types of embodied practices and suggest a shared cognitive basis, for signed and spoken languages. According to this approach language is seen as a form of action where the aim is always to produce meanings and to this end diverse semiotic resources are mobilized and its multimodal character is equally applicable to the study of spoken and signed language.

Attention, Incrementality, and Meaning: On the Interplay between Language and Vision in Reference Resolution

Matthias Scheutz, Tufts University

Humans are known to incorporate visual constraints in their incremental resolution of referential expressions. Language semantics here can guide visual attention and, conversely, visual processing provides candidate referents for further semantic evaluation. In this presentation, we will provide an overview of our cognitive models of incremental language understanding and interaction with visual processes, and also describe the architectural and representational challenges posed by a deep integration of language and vision in computational architectures of embodied agents.

Accepted Papers

EMISSOR: A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological References

Selene Baez Santamaria, Thomas Baier, Taewoon Kim, Lea Krause, Jaap Kruijt and Piek Vossen

Teaching Arm and Head Gestures to a Humanoid Robot through Interactive Demonstration and Spoken Instruction

Michael Brady and Han Du

Requesting clarifications with speech and gestures

Jonathan Ginzburg and Andy Luecking

How vision affects language: comparing masked self-attention in uni-modal and multi-modal transformer

Nikolai Ilinykh and Simon Dobnik

Incremental Unit Networks for Multimodal, Fine-grained Information State Representation

Casey Kennington and David Schlangen

Annotating anaphoric phenomena in situated dialogue

Sharid LoƔiciga, Simon Dobnik and David Schlangen

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Letitia Parcalabescu, Albert Gatt, Anette Frank and Iacer Calixto

What is Multimodality?

Letitia Parcalabescu, Nils Trost and Anette Frank

Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

Riko Suzuki, Hitomi Yanaka, Koji Mineshima and Daisuke Bekki

Are Gestures Worth a Thousand Words? An Analysis of Interviews in the Political Domain

Daniela Trotta and Sara Tonelli