MMSR II

Beyond Language: Multimodal Semantic Representations

Home

Organization

MMSR I

MMSR I (2021)

Welcome to MMSR I, the first workshop on Multimodal Semantic Representations!

The demand for more sophisticated natural human-computer and human-robot interactions is rapidly increasing as users become more accustomed to conversation-like interactions with AI and NLP systems. Such interactions require not only the robust recognition and generation of expressions through multiple modalities (language, gesture, vision, action, etc.), but also the encoding of situated meaning.

When communications become multimodal, each modality in operation provides an orthogonal angle through which to probe the computational model of the other modalities, including the behaviors and communicative capabilities afforded by each. Multimodal interactions thus require a unified framework and control language through which systems interpret inputs and behaviors and generate informative outputs. This is vital for intelligent and often embodied systems to understand the situation and context that they inhabit, whether in the real world or in a mixed-reality environment shared with humans.

Goals

This workshop intends to bring together researchers who aim to capture elements of multimodal interaction such as language, gesture, gaze, and facial expression with formal semantic representations. We provide a space for both theoretical and practical discussion of how linguistic co-modalities support, inform, and align with “meaning” found in the linguistic signal alone. In so doing, the MMSR workshop has several goals:

  1. To provide an opportunity for computational semanticists to critically examine existing NLP semantic frameworks for their validity to express multimodal elements;
  2. To explore and identify challenges in the semantic representation of co-modalities cross-linguistically and cross-culturally;
  3. To gain understanding of domains and tasks where certain semantic frameworks (multimodal or not) are most effective and why.

Venue

MMSR I is being held online in conjuction with IWCS 2021! Please check out our Call for Papers:

Submissions

We solicit papers on multimodal semantic representation, including but not limited to the following topics:

Two types of submission are solicited: long papers (8 pages, excluding references and acknowledgments) and short papers (4 pages, excluding references and acknowledgments). Accepted papers will be published in the ACL anthology. Authors will receive up to an extra page to address reviewer comments for the camera-ready version. Submissions should use the IWCS 2021 stylesheet and should be fully anonymized for double-blind reviewing.

Like IWCS 2021, MMSR does not have a pre-submission anonymity period, but we ask authors not to publicly advertise any preprints of submitted work during (or right before) the review period.

We strongly encourage students to submit to the workshop and will consider a student session depending on the number of submissions.

We will be using SoftConf to handle submissions. We have extended the submission deadline to March 26! Please submit your papers here!

Important Dates

Tentative schedule, subject to change:

   
March 19 March 26 Submissions due (deadline extended!)
April 16 Notification of acceptance decisions
May 7 Camera-ready papers due
June 16 (pending finalization) MMSR Workshop

Programme

MMSR I will be held in conjuction with IWCS 2021, on June 16 (16:00-20:00 Central European Time). The workshop will combine formal paper presentations with sessions discussing themes and key challenges in multimodal semantic representations.

The workshop program is now available!

Invited Talks

We are delighted to have Virginia Volterra of CNR-Institute of Cognitive Sciences and Technologies and Matthias Scheutz of the Tufts University HRI Lab as invited speakers!

From action to language through gesture

Virginia Volterra and Chiara Bonsignori, ISTC, CNR, Rome

This talk presents developmental evidence on continuity from action, to gesture to sign and word. When children use a word, they re-construct in some form the sensory and motor information they experienced with the referent. The basic formational components and the main depicting strategies observed in studies of adult gesture and sign research are already present in the representational gestures of two-year-old hearing children acquiring spoken language from different cultural and linguistic groups. Representational strategies for depicting information about objects and events make visible different types of embodied practices and suggest a shared cognitive basis, for signed and spoken languages. According to this approach language is seen as a form of action where the aim is always to produce meanings and to this end diverse semiotic resources are mobilized and its multimodal character is equally applicable to the study of spoken and signed language.

Attention, Incrementality, and Meaning: On the Interplay between Language and Vision in Reference Resolution

Matthias Scheutz, Tufts University

Humans are known to incorporate visual constraints in their incremental resolution of referential expressions. Language semantics here can guide visual attention and, conversely, visual processing provides candidate referents for further semantic evaluation. In this presentation, we will provide an overview of our cognitive models of incremental language understanding and interaction with visual processes, and also describe the architectural and representational challenges posed by a deep integration of language and vision in computational architectures of embodied agents.

Accepted Papers

EMISSOR: A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological References

Selene Baez Santamaria, Thomas Baier, Taewoon Kim, Lea Krause, Jaap Kruijt and Piek Vossen

Teaching Arm and Head Gestures to a Humanoid Robot through Interactive Demonstration and Spoken Instruction

Michael Brady and Han Du

Requesting clarifications with speech and gestures

Jonathan Ginzburg and Andy Luecking

How vision affects language: comparing masked self-attention in uni-modal and multi-modal transformer

Nikolai Ilinykh and Simon Dobnik

Incremental Unit Networks for Multimodal, Fine-grained Information State Representation

Casey Kennington and David Schlangen

Annotating anaphoric phenomena in situated dialogue

Sharid Loáiciga, Simon Dobnik and David Schlangen

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Letitia Parcalabescu, Albert Gatt, Anette Frank and Iacer Calixto

What is Multimodality?

Letitia Parcalabescu, Nils Trost and Anette Frank

Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

Riko Suzuki, Hitomi Yanaka, Koji Mineshima and Daisuke Bekki

Are Gestures Worth a Thousand Words? An Analysis of Interviews in the Political Domain

Daniela Trotta and Sara Tonelli

Organization

Organizers

Program Committee

Contact email

mmsr.workshop.2021@gmail.com