Beyond Language: Multimodal Semantic Representations
Welcome to MMSR I, the first workshop on Multimodal Semantic Representations!
The demand for more sophisticated natural human-computer and human-robot interactions is rapidly increasing as users become more accustomed to conversation-like interactions with AI and NLP systems. Such interactions require not only the robust recognition and generation of expressions through multiple modalities (language, gesture, vision, action, etc.), but also the encoding of situated meaning.
When communications become multimodal, each modality in operation provides an orthogonal angle through which to probe the computational model of the other modalities, including the behaviors and communicative capabilities afforded by each. Multimodal interactions thus require a unified framework and control language through which systems interpret inputs and behaviors and generate informative outputs. This is vital for intelligent and often embodied systems to understand the situation and context that they inhabit, whether in the real world or in a mixed-reality environment shared with humans.
This workshop intends to bring together researchers who aim to capture elements of multimodal interaction such as language, gesture, gaze, and facial expression with formal semantic representations. We provide a space for both theoretical and practical discussion of how linguistic co-modalities support, inform, and align with “meaning” found in the linguistic signal alone. In so doing, the MMSR workshop has several goals:
MMSR I is being held online in conjuction with IWCS 2021! Please check out our Call for Papers:
We solicit papers on multimodal semantic representation, including but not limited to the following topics:
Two types of submission are solicited: long papers (8 pages, excluding references and acknowledgments) and short papers (4 pages, excluding references and acknowledgments). Accepted papers will be published in the ACL anthology. Authors will receive up to an extra page to address reviewer comments for the camera-ready version. Submissions should use the IWCS 2021 stylesheet and should be fully anonymized for double-blind reviewing.
Like IWCS 2021, MMSR does not have a pre-submission anonymity period, but we ask authors not to publicly advertise any preprints of submitted work during (or right before) the review period.
We strongly encourage students to submit to the workshop and will consider a student session depending on the number of submissions.
We will be using SoftConf to handle submissions. We have extended the submission deadline to March 26! Please submit your papers here!
Tentative schedule, subject to change:
Submissions due (deadline extended!) | |
April 16 | Notification of acceptance decisions |
May 7 | Camera-ready papers due |
June 16 (pending finalization) | MMSR Workshop |
MMSR I will be held in conjuction with IWCS 2021, on June 16 (16:00-20:00 Central European Time). The workshop will combine formal paper presentations with sessions discussing themes and key challenges in multimodal semantic representations.
The workshop program is now available!
We are delighted to have Virginia Volterra of CNR-Institute of Cognitive Sciences and Technologies and Matthias Scheutz of the Tufts University HRI Lab as invited speakers!
Virginia Volterra and Chiara Bonsignori, ISTC, CNR, Rome
This talk presents developmental evidence on continuity from action, to gesture to sign and word. When children use a word, they re-construct in some form the sensory and motor information they experienced with the referent. The basic formational components and the main depicting strategies observed in studies of adult gesture and sign research are already present in the representational gestures of two-year-old hearing children acquiring spoken language from different cultural and linguistic groups. Representational strategies for depicting information about objects and events make visible different types of embodied practices and suggest a shared cognitive basis, for signed and spoken languages. According to this approach language is seen as a form of action where the aim is always to produce meanings and to this end diverse semiotic resources are mobilized and its multimodal character is equally applicable to the study of spoken and signed language.
Matthias Scheutz, Tufts University
Humans are known to incorporate visual constraints in their incremental resolution of referential expressions. Language semantics here can guide visual attention and, conversely, visual processing provides candidate referents for further semantic evaluation. In this presentation, we will provide an overview of our cognitive models of incremental language understanding and interaction with visual processes, and also describe the architectural and representational challenges posed by a deep integration of language and vision in computational architectures of embodied agents.
EMISSOR: A platform for capturing multimodal interactions as Episodic Memories and Interpretations with Situated Scenario-based Ontological References
Selene Baez Santamaria, Thomas Baier, Taewoon Kim, Lea Krause, Jaap Kruijt and Piek Vossen
Teaching Arm and Head Gestures to a Humanoid Robot through Interactive Demonstration and Spoken Instruction
Michael Brady and Han Du
Requesting clarifications with speech and gestures
Jonathan Ginzburg and Andy Luecking
How vision affects language: comparing masked self-attention in uni-modal and multi-modal transformer
Nikolai Ilinykh and Simon Dobnik
Incremental Unit Networks for Multimodal, Fine-grained Information State Representation
Casey Kennington and David Schlangen
Annotating anaphoric phenomena in situated dialogue
Sharid Loáiciga, Simon Dobnik and David Schlangen
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu, Albert Gatt, Anette Frank and Iacer Calixto
What is Multimodality?
Letitia Parcalabescu, Nils Trost and Anette Frank
Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference
Riko Suzuki, Hitomi Yanaka, Koji Mineshima and Daisuke Bekki
Are Gestures Worth a Thousand Words? An Analysis of Interviews in the Political Domain
Daniela Trotta and Sara Tonelli