Recrutement Doctorat.Gouv.Fr

Thèse Ancrage Multimodal et Social des Modèles de Langue Oraux pour l'Étude du Développement du Langage Humain H/F - Doctorat.Gouv.Fr

  • Grenoble - 38
  • CDD
  • Doctorat.Gouv.Fr
Publié le 21 mai 2026
Postuler sur le site du recruteur

Les missions du poste

Établissement : Université Grenoble Alpes École doctorale : EEATS - Electronique, Electrotechnique, Automatique, Traitement du Signal Laboratoire de recherche : Grenoble Images Parole Signal Automatique Direction de la thèse : Thomas HUEBER ORCID 0000000282965177 Début de la thèse : 2026-10-01 Date limite de candidature : 2026-06-16T23:59:59 Ce projet de thèse vise à étudier comment les interactions multimodales et sociales contribuent à l'acquisition du langage humain à travers le développement de modèles de langage oral (SpeechLMs) ancrés dans le monde réel et appris directement à partir de la parole brute. Le premier objectif consiste à étudier comment des environnements audiovisuels réalistes peuvent favoriser la segmentation de la parole, la découverte lexicale et l'émergence de représentations robustes de la parole. Le second objectif est de modéliser l'acquisition du langage comme un processus interactif enfant-parent, dans lequel les retours communicatifs guident l'apprentissage lexical. Enfin, une direction plus exploratoire consistera à intégrer ces modèles dans des plateformes robotiques humanoïdes afin d'étudier la communication ancrée dans des situations d'interaction réelles. Recent progress in self-supervised and generative speech modeling has led to the emergence of textless Speech Language Models (SpeechLMs), capable of learning directly from raw speech without textual supervision [Arora et al., 2025]. These models constitute a promising computational framework for studying language acquisition in a manner closer to how human infants learn language before literacy [Dupoux, 2018], while also offering an opportunity to draw inspiration from human learning mechanisms to design more adaptive and grounded conversational AI systems. Despite recent progress, current textless SpeechLMs still struggle to acquire higher-level linguistic competencies such as robust lexical representations, syntax, or semantics when trained on amounts of speech data comparable to those available to human infants during development, or when learning from ecologically realistic data [Lakhotia et al., 2021; Lavechin et al., 2024]. This discrepancy suggests that scale alone may not be sufficient for language acquisition. One possible explanation is that current models largely lack the multimodal and social grounding mechanisms that characterize early human learning [Dupoux, LeCun & Malik, 2026]. In natural development, language emerges through rich perceptual and communicative interactions combining speech, vision, action, attention, and adaptive social feedback. Understanding the role of such grounding mechanisms therefore constitutes an important challenge for both developmental science and the next generation of conversational AI systems.

Recent computational models of visually grounded speech learning have progressively explored how lexical structure may emerge from the joint statistics of speech and vision. Early work on cross-situational word learning showed that learners can associate spoken forms with visual referents by accumulating evidence across individually ambiguous situations [Smith & Yu, 2008]. Subsequent computational models extended this idea to continuous speech, showing that visual grounding can also support speech segmentation and the discovery of word-like units [Räsänen & Rasilo, 2015; Havard et al., 2019]. More recent visually grounded speech models based on self-supervised and contrastive learning further demonstrated that aligning raw speech with concurrent visual scenes can lead to the emergence of phonological and lexical representations without textual supervision [Harwath et al., 2018; Chrupala, 2022; Khorrami & Räsänen, 2025].

Beyond perceptual grounding, a growing line of computational research emphasizes the role of social interaction in language acquisition. In natural development, children are not merely exposed to multimodal sensory input, but actively engage in communicative exchanges where they produce linguistic behaviors, receive feedback, and progressively adapt their internal representations through interaction. Recent computational models have therefore started to investigate how learning may emerge from the interplay between perception, production, and social feedback. In particular, Nikolaus and Fourtassi [2021] proposed a neural model integrating both perception-based and production-based learning, showing that active language production and interaction-driven feedback improve semantic acquisition beyond passive perceptual learning alone. Their results highlight the importance of modeling language development not only as multimodal statistical learning from sensory input, but also as an interactive and socially guided process in which learners actively participate in the construction of their linguistic knowledge.

Nevertheless, current computational models of multimodal and social grounding still suffer from several important limitations. Most visually grounded models are trained on highly simplified image-captioning datasets that only weakly reflect the richness and ambiguity of real infant experience. Conversely, models attempting to incorporate social interaction often rely on text-based representations as an intermediate backbone, thereby ignoring many crucial communicative signals conveyed directly through speech itself, including the prosodic and interactive cues characteristic of child-directed speech (CDS), such as prosodic emphasis, repetition, corrective feedback, and interactive clarification strategies. More fundamentally, existing approaches rarely model language acquisition as a dynamic co-adaptation process between the child and the caregiver. Modeling how such multimodal and social interactions shape language learning therefore remains a major challenge, and constitutes a central motivation of the present PhD project.

Research Roadmap
Multimodal Grounding for Lexical Acquisition

The first objective will consist in developing a multimodal textless SpeechLM trained on realistic audiovisual environments going beyond standard caption-based datasets. In particular, the project will investigate controlled multimodal environments inspired by early infant experience, including realistic naming events, varying visual ambiguity, and different communicative speaking styles such as child-directed speech. The goal will be to study how multimodal perceptual information contributes to speech segmentation, lexical discovery, referential grounding, and the emergence of robust speech representations. Special attention will be paid to the interaction between acoustic variability, prosodic structure, and visual context during lexical acquisition.

Interactive and Socially-Grounded Language Learning

The second objective will investigate the role of communicative interaction in language acquisition through an interactive child-caregiver learning framework. The PhD researcher will develop an agentic interaction setup in which a textless child SpeechLM progressively acquires lexical knowledge, while a pretrained caregiver model dynamically adapts its communicative behavior according to the learner's current state. Inspired by developmental studies of caregiver-infant interaction, the caregiver model will be able to emphasize target words, provide corrective or reinforcing feedback, adapt lexical complexity and use contrastive naming strategies, and simulate multimodal communicative cues such as gaze or pointing to guide attention. This framework will enable controlled studies of how interactive feedback and communicative adaptation influence lexical learning.

Toward Real-World Human-Robot Interaction (Exploratory Direction)

As a more exploratory perspective, the developed multimodal SpeechLMs will be embedded into the humanoid robotic platforms available at GIPSA-lab and INRIA in order to study grounded communication in situated interaction settings. This direction is notably inspired by work on multimodal communicative behaviors in embodied conversational agents [Deichler et al., 2023] as well as developmental robotic approaches investigating the emergence of language and gestures through social interaction and sensorimotor exploration [Cohen & Billard, 2018]. Such an extension could enable preliminary human-robot interaction experiments in which participants naturally interact with the agent and provide multimodal communicative feedback during situated language learning tasks.

Le profil recherché

Les candidats devront être titulaires d'un Master (ou diplôme équivalent) dans un ou plusieurs des domaines suivants : traitement automatique des langues et de la parole, vision par ordinateur, linguistique computationnelle, informatique, science des données, apprentissage automatique, ou domaines connexes. De bonnes compétences en programmation Python ainsi qu'une expérience des frameworks d'apprentissage profond tels que PyTorch sont attendues.

Le ou la candidate devra également démontrer un fort intérêt pour la recherche interdisciplinaire à l'intersection de l'intelligence artificielle, des technologies de la parole et des sciences cognitives (un intérêt pour les approches visant à rapprocher l'IA et la cognition humaine sera particulièrement apprécié).

De bonnes capacités de communication et d'organisation sont importantes, le doctorant ou la doctorante étant amené(e) à travailler de manière collaborative dans un environnement de recherche interdisciplinaire et à participer activement aux activités de diffusion scientifique. Un bon niveau d'anglais écrit et oral est requis, incluant la capacité à présenter clairement des résultats de recherche en conférence et à rédiger des publications scientifiques.

Postuler sur le site du recruteur

Ces offres pourraient aussi vous correspondre.

Parcourir plus d'offres d'emploi