Doctorant Ancrage Multimodal et Social des Modèles de Langue Oraux pour l'Étude du Développement du Langage Humain H/F - INRIA
- Montbonnot-Saint-Martin - 38
- CDD
- INRIA
Les missions du poste
A propos d'Inria
Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.
Doctorant F/H Ancrage multimodal et social des modèles de langue oraux pour l'étude du développement du langage humain
Type de contrat : CDD
Niveau de diplôme exigé : Bac +5 ou équivalent
Fonction : Doctorant
A propos du centre ou de la direction fonctionnelle
Le Centre Inria de l'Université Grenoble Alpes, regroupe un peu moins de 450 personnes réparties au sein de 26 équipes de recherche et 9 services support à la recherche.
Son effectif est distribué sur 3 campus à Grenoble, en lien étroit avec les laboratoires et les établissements de recherche et d'enseignement supérieur (Université Grenoble Alpes, CNRS, CEA, INRAE, ...), mais aussi avec les acteurs économiques du territoire.
Présent dans les domaines du calcul et grands systèmes distribués, logiciels sûrs et systèmes embarqués, la modélisation de l'environnement à différentes échelles etla science des données et intelligence artificielle, le Centre Inria de l'Université Grenoble Alpes participe au meilleur niveau à la vie scientifique internationale par les résultats obtenus et les collaborations tant en Europe que dans le reste du monde.
Contexte et atouts du poste
Scientific Context and Motivation
Recent progress in self-supervised and generative speech modeling has led to the emergence of textless Speech Language Models (SpeechLMs), capable of learning directly from raw speech without textual supervision [Arora et al., 2025]. These models constitute a promising computational framework for studying language acquisition in a manner closer to how human infants learn language before literacy [Dupoux, 2018], while also offering an opportunity to draw inspiration from human learning mechanisms to design more adaptive and grounded conversational AI systems. Despite recent progress, current textless SpeechLMs still struggle to acquire higher-level linguistic competencies such as robust lexical representations, syntax, or semantics when trained on amounts of speech data comparable to those available to human infants during development, or when learning from ecologically realistic data [Lakhotia et al., 2021; Lavechin et al., 2024]. This discrepancy suggests that scale alone may not be sufficient for language acquisition. One possible explanation is that current models largely lack the multimodal and social grounding mechanisms that characterize early human learning [Dupoux, LeCun & Malik, 2026]. In natural development, language emerges through rich perceptual and communicative interactions combining speech, vision, action, attention, and adaptive social feedback. Understanding the role of such grounding mechanisms therefore constitutes an important challenge for both developmental science and the next generation of conversational AI systems.
Recent computational models of visually grounded speech learning have progressively explored how lexical structure may emerge from the joint statistics of speech and vision. Early work on cross-situational word learning showed that learners can associate spoken forms with visual referents by accumulating evidence across individually ambiguous situations [Smith & Yu, 2008]. Subsequent computational models extended this idea to continuous speech, showing that visual grounding can also support speech segmentation and the discovery of word-like units [Räsänen & Rasilo, 2015; Havard et al., 2019]. More recent visually grounded speech models based on self-supervised and contrastive learning further demonstrated that aligning raw speech with concurrent visual scenes can lead to the emergence of phonological and lexical representations without textual supervision [Harwath et al., 2018; Chrupala, 2022; Khorrami & Räsänen, 2025].
Beyond perceptual grounding, a growing line of computational research emphasizes the role of social interaction in language acquisition. In natural development, children are not merely exposed to multimodal sensory input, but actively engage in communicative exchanges where they produce linguistic behaviors, receive feedback, and progressively adapt their internal representations through interaction. Recent computational models have therefore started to investigate how learning may emerge from the interplay between perception, production, and social feedback. In particular, Nikolaus and Fourtassi [2021] proposed a neural model integrating both perception-based and production-based learning, showing that active language production and interaction-driven feedback improve semantic acquisition beyond passive perceptual learning alone. Their results highlight the importance of modeling language development not only as multimodal statistical learning from sensory input, but also as an interactive and socially guided process in which learners actively participate in the construction of their linguistic knowledge.
Nevertheless, current computational models of multimodal and social grounding still suffer from several important limitations. Most visually grounded models are trained on highly simplified image-captioning datasets that only weakly reflect the richness and ambiguity of real infant experience. Conversely, models attempting to incorporate social interaction often rely on text-based representations as an intermediate backbone, thereby ignoring many crucial communicative signals conveyed directly through speech itself, including the prosodic and interactive cues characteristic of child-directed speech (CDS), such as prosodic emphasis, repetition, corrective feedback, and interactive clarification strategies. More fundamentally, existing approaches rarely model language acquisition as a dynamic co-adaptation process between the child and the caregiver. Modeling how such multimodal and social interactions shape language learning therefore remains a major challenge, and constitutes a central motivation of the present PhD project.
References
- Arora, S., Chang, K. W., Chien, C. M., Peng, Y., Wu, H., Adi, Y., & Watanabe, S. (2025). On the Landscape of Spoken Language Models: A Comprehensive Survey. arXiv preprint arXiv:2504.08528.
- Chrupala, G. (2022). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research, 73, 673-707.
- Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language learner. Cognition, 173, 43-59.
- Dupoux, E., LeCun, Y., & Malik, J. (2026). Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science. arXiv preprint arXiv:2603.15381.
- Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., & Glass, J. (2018). Jointly discovering visual objects and spoken words from raw sensory input. Proceedings of ECCV, 649-665.
- Havard, W. N., Chevrot, J.-P., & Besacier, L. (2019). Word recognition, competition, and activation in a model of visually grounded speech. Proceedings of CoNLL, 339-348.
- Khorrami, K., & Räsänen, O. (2025). A model of early word acquisition based on realistic-scale audiovisual naming events. Speech Communication, 167.
- Lakhotia, K., Kharitonov, E., Hsu, W.-N., Adi, Y., Polyak, A., Bolte, B., Nguyen, T.-A., Copet, J., Baevski, A., Mohamed, A., & Dupoux, E. (2021). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9, 1336-1354.
- Lavechin, M., de Seyssel, M., Métais, M., Metze, F., Mohamed, A., Bredin, H., Dupoux, E., & Cristia, A. (2024). Modeling early phonetic acquisition from child-centered audio data. Cognition, 245, 105734.
- Nikolaus, M., & Fourtassi, A. (2021). Modeling the interaction between perception-based and production-based learning in children's early acquisition of semantic knowledge. Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL), 391-407.
- Räsänen, O., & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122(4), 792.
- Smith, L., & Yu, C. (2008). Infants rapidly learn word-referent mappings via cross-situational statistics. Cognition, 106(3), 1558-1568.
Mission confiée
Research Roadmap
Multimodal Grounding for Lexical Acquisition
The first objective will consist in developing a multimodal textless SpeechLM trained on realistic audiovisual environments going beyond standard caption-based datasets. In particular, the project will investigate controlled multimodal environments inspired by early infant experience, including realistic naming events, varying visual ambiguity, and different communicative speaking styles such as child-directed speech. The goal will be to study how multimodal perceptual information contributes to speech segmentation, lexical discovery, referential grounding, and the emergence of robust speech representations. Special attention will be paid to the interaction between acoustic variability, prosodic structure, and visual context during lexical acquisition.
Interactive and Socially-Grounded Language Learning
The second objective will investigate the role of communicative interaction in language acquisition through an interactive child-caregiver learning framework. The PhD researcher will develop an agentic interaction setup in which a textless child SpeechLM progressively acquires lexical knowledge, while a pretrained caregiver model dynamically adapts its communicative behavior according to the learner's current state. Inspired by developmental studies of caregiver-infant interaction, the caregiver model will be able to emphasize target words, provide corrective or reinforcing feedback, adapt lexical complexity and use contrastive naming strategies, and simulate multimodal communicative cues such as gaze or pointing to guide attention. This framework will enable controlled studies of how interactive feedback and communicative adaptation influence lexical learning.
Toward Real-World Human-Robot Interaction (Exploratory Direction)
As a more exploratory perspective, the developed multimodal SpeechLMs will be embedded into the humanoid robotic platforms available at GIPSA-lab and INRIA in order to study grounded communication in situated interaction settings. This direction is notably inspired by work on multimodal communicative behaviors in embodied conversational agents [Deichler et al., 2023] as well as developmental robotic approaches investigating the emergence of language and gestures through social interaction and sensorimotor exploration [Cohen & Billard, 2018]. Such an extension could enable preliminary human-robot interaction experiments in which participants naturally interact with the agent and provide multimodal communicative feedback during situated language learning tasks.
Principales activités
Principales activités
Ce projet de thèse vise à étudier comment les interactions multimodales et sociales contribuent à l'acquisition du langage humain à travers le développement de modèles de langage oral (SpeechLMs) ancrés dans le monde réel et appris directement à partir de la parole brute. Le premier objectif consiste à étudier comment des environnements audiovisuels réalistes peuvent favoriser la segmentation de la parole, la découverte lexicale et l'émergence de représentations robustes de la parole. Le second objectif est de modéliser l'acquisition du langage comme un processus interactif enfant-parent, dans lequel les retours communicatifsguident l'apprentissage lexical. Enfin, une direction plus exploratoire consistera à intégrer ces modèles dans des plateformes robotiques humanoïdes afin d'étudier la communication ancrée dans des situations d'interaction réelles.
Compétences
Applicants should hold a Master's degree (or equivalent) in one or several of the following fields: natural language and speech processing, computer vision, computational linguistics, computer science, data science, machine learning, or related areas.
Good programming skills in Python and experience with deep learning frameworks such as PyTorch are expected.
The candidate should also demonstrate a strong interest in interdisciplinary research at the intersection of artificial intelligence, speech technologies, and cognitive science (an interest in bridging AI and human cognition is highly desirable).
Strong communication and organizational skills are important, as the PhD student will be expected to work collaboratively within an interdisciplinary research environment and actively participate in scientific dissemination activities.
A good level of spoken and written English is required, including the ability to present research results clearly at conferences and to write scientific publications.
Avantages
- Restauration subventionnée
- Transports publics remboursés partiellement
- Congés: 7 semaines de congés annuels + 10 jours de RTT (base temps plein) + possibilité d'autorisations d'absence exceptionnelle (ex : enfants malades, déménagement)
- Possibilité de télétravail et aménagement du temps de travail
- Équipements professionnels à disposition (visioconférence, prêts de matériels informatiques, etc.)
- Prestations sociales, culturelles et sportives (Association de gestion des oeuvres sociales d'Inria)
- Accès à la formation professionnelle
- Sécurité sociale et mutuelle professionnelle - sous conditions
Rémunération
2300 euros brut / mois