
Doctorant Doctorant Phd Position Bridging Text-To-Video Generation And Virtual Try-On For Realistic Human-Centric Scene Synthesis H/F - INRIA
- Montbonnot-Saint-Martin - 38
- CDD
- INRIA
Les missions du poste
A propos d'Inria
Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.Doctorant F/H Doctorant F/H PhD Position : Bridging Text-to-Video Generation and Virtual Try-On for Realistic Human-Centric Scene Synthesis
Type de contrat : CDD
Niveau de diplôme exigé : Bac +5 ou équivalent
Fonction : Doctorant
A propos du centre ou de la direction fonctionnelle
Le Centre Inria de l'Université Grenoble Alpes, regroupe un peu moins de 600 personnes réparties au sein de 22 équipes de recherche et 7 services support à la recherche.
Son effectif est distribué sur 3 campus à Grenoble, en lien étroit avec les laboratoires et les établissements de recherche et d'enseignement supérieur (Université Grenoble Alpes, CNRS, CEA, INRAE, ...), mais aussi avec les acteurs économiques du territoire.
Présent dans les domaines du calcul et grands systèmes distribués, logiciels sûrs et systèmes embarqués, la modélisation de l'environnement à différentes échelles et la science des données et intelligence artificielle, le Centre Inria de l'Université Grenoble Alpe participe au meilleur niveau à la vie scientifique internationale par les résultats obtenus et les collaborations tant en Europe que dans le reste du monde.
Contexte et atouts du poste
Titre :Bridging Text-to-Video Generation and Virtual Try-On for Realistic Human-Centric Scene Synthesis
Supervision : Dr Stéphane Lathuilière (INRIA-UGA)
Funding : BPI contract
Contexte :Background and Motivation
Recent advancements in generative AI, and in particular diffusion models [1,2], have significantly enhanced the capabilities of text-to-video (T2V) models[3,4], allowing users to produce richly varied and imaginative scenes from natural language descriptions. These systems demonstrate strong scene diversity and flexibility, making them attractive for applications in entertainment, simulation, and human-computer interaction. However, a persistent limitation lies in their inability to enforce fine-grained conditioning. For example, while a T2V model can generate a person walking in a park, it cannot ensure that the person is wearing a specific garment or that the garment adapts convincingly to body shape, pose, and interaction with the environment. In contrast, virtual try-on (VTON) systems are highly specialized in clothing transfer tasks~[5], excelling at fine-grained conditioning of garments on target individuals. They can adapt clothing to morphology, pose, and texture details with remarkable realism. Yet, they lack the scene diversity and broader contextual awareness that T2V models offer. Current VTON approaches generally operate in isolation, focusing on clothing alignment rather than situating the dressed person within dynamic, complex environments. Bridging these two paradigms offers a powerful opportunity: to synthesize realistic humans dressed in controllable garments, embedded within richly described environments, and interacting with objects and other people. This integration could transform applications in e-commerce (immersive virtual try-on experiences), creative industries (fashion films, digital avatars), and simulation (training data for human-AI interaction).
Mission confiée
Research Objectives :
The main objective of this PhD project is to unify text-based scene control with the fine-grained detail preservation typical of virtual try-on systems. The project seeks to develop new architectures that combine the flexibility and diversity of text-to-video models with the precision of garment conditioning, ensuring that garments retain their texture, shape, and material properties while adapting seamlessly to a subject's morphology and pose. Beyond visual fidelity, the project will focus on enabling realistic interactions between humans and their environments, such as handling objects, sitting on furniture, or engaging with other individuals in a scene, while capturing the dynamic responses of garments to these interactions. A further objective is to maintain coherence between the described scene and the
conditioned garments by ensuring consistency in lighting, shading, and motion. To achieve
this, the research will explore multi-modal conditioning approaches that leverage text
prompts, garment images, and body parameters simultaneously. Finally, the project will
establish a comprehensive evaluation framework that combines quantitative metrics with
perceptual user studies to measure garment fidelity, scene realism, and temporal
consistency, thereby providing a robust benchmark for assessing progress in this domain.
Principales activités
Methodology
The methodology of this research will follow a multi-stage approach. The initial stage will
focus on establishing a baseline by conditioning state-of-the-art text-to-video models with
garment images and segmentation masks. This will involve experimenting with latent space
fusion techniques to balance scene diversity with detail preservation, ensuring that
garments remain faithful to their source while being embedded into dynamic,
text-generated environments. The research will not be limited to garments. Virtual try-on
will be an example of tasks that require fine-grained control.
The second stage will address garment or object dynamics and interaction modeling. At this
point, the research will investigate how to incorporate physics-aware modules or learned
priors for modeling deformation, allowing clothing or objects to realistically respond to
body pose changes and environmental interactions (including both human-object and
human-human interactions). Human pose estimation and object interaction cues will be
integrated into the generative pipeline to enable scenarios where garments respond
naturally to motion, contact, and occlusion.
The third stage of the methodology will focus on cross-modal consistency mechanisms. This
will involve designing cross-attention strategies and control networks to align text-based
scene descriptions with garment or object conditioning inputs. Special attention will be
given to developing consistency losses and temporal regularization strategies to ensure that
garment details, lighting conditions, and scene elements remain coherent across frames in
video generation. We also plan to include other modalities, such as audio, to improve
realism.
Finally, the project will focus on evaluation, particularly emphasizing cases that involve
human-environment interaction. This dataset will serve as a testbed for evaluating the
proposed framework against existing T2V and VTON baselines. Evaluations will combine
standard generative quality metrics, such as Fréchet Inception Distance (FID)[6], with
perceptual studies aimed at capturing user preferences and realism judgments.
References
[1] J. Ho, A. Jain, and P. Abbeel, Denoising diffusion probabilistic models, Advances in
Neural Information Processing Systems (Neurips), 2020.
Compétences
Compétences techniques et niveau requis :We are seeking a motivated PhD candidate with a strong background in one or more the following areas :
- speech processing, computer vision, machine learning,
- solid programmming skills
- interest in connecting AI with human cognition Prior experience with LLM, SpeechLMs, RL algorithms, or robotic platforms is a plus, but not mandatory
Langues : Anglais
Avantages
- Restauration subventionnée
- Transports publics remboursés partiellement
- Congés: 7 semaines de congés annuels + 10 jours de RTT (base temps plein) + possibilité d'autorisations d'absence exceptionnelle (ex : enfants malades, déménagement)
- Possibilité de télétravail et aménagement du temps de travail
- Équipements professionnels à disposition (visioconférence, prêts de matériels informatiques, etc.)
- Prestations sociales, culturelles et sportives (Association de gestion des oeuvres sociales d'Inria)
- Accès à la formation professionnelle
- Sécurité sociale