Recrutement INRIA

Ingénieur Text-To-Video Generation And Editing For Realistic Human-Centric Scene Synthesis H/F - INRIA

  • Montbonnot-Saint-Martin - 38
  • CDD
  • INRIA
Publié le 17 décembre 2025
Postuler sur le site du recruteur

Les missions du poste

A propos d'Inria

Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.Ingénieur: Text-to-Video Generation and Editing for Realistic Human-Centric Scene Synthesis

Type de contrat : CDD

Niveau de diplôme exigé : Bac +5 ou équivalent

Fonction : Ingénieur scientifique contractuel

A propos du centre ou de la direction fonctionnelle

Le Centre Inria de l'Université Grenoble Alpes, regroupe un peu moins de 600 personnes réparties au sein de 22 équipes de recherche et 7 services support à la recherche.

Son effectif est distribué sur 3 campus à Grenoble, en lien étroit avec les laboratoires et les établissements de recherche et d'enseignement supérieur (Université Grenoble Alpes, CNRS, CEA, INRAE, ...), mais aussi avec les acteurs économiques du territoire.

Présent dans les domaines du calcul et grands systèmes distribués, logiciels sûrs et systèmes embarqués, la modélisation de l'environnement à différentes échelles et la science des données et intelligence artificielle, le Centre Inria de l'Université Grenoble Alpe participe au meilleur niveau à la vie scientifique internationale par les résultats obtenus et les collaborations tant en Europe que dans le reste du monde.

Contexte et atouts du poste

Titre :Bridging Text-to-Video Generation and Virtual Try-On for Realistic Human-Centric Scene Synthesis

Supervision : Dr Stéphane Lathuilière (INRIA-UGA)

Funding : BPI contract

Contexte :Background and Motivation
Recent advancements in generative AI, and in particular diffusion models [1,2], have significantly enhanced the capabilities of text-to-video (T2V) models[3,4], allowing users to produce richly varied and imaginative scenes from natural language descriptions. These systems demonstrate strong scene diversity and flexibility, making them attractive for applications in entertainment, simulation, and human-computer interaction. However, a persistent limitation lies in their inability to enforce fine-grained conditioning. For example, while a T2V model can generate a person walking in a park, it cannot ensure that the person is wearing a specific garment or that the garment adapts convincingly to body shape, pose, and interaction with the environment. In contrast, virtual try-on (VTON) systems are highly specialized in clothing transfer tasks~[5], excelling at fine-grained conditioning of garments on target individuals. They can adapt clothing to morphology, pose, and texture details with remarkable realism. Yet, they lack the scene diversity and broader contextual awareness that T2V models offer. Current VTON approaches generally operate in isolation, focusing on clothing alignment rather than situating the dressed person within dynamic, complex environments. Bridging these two paradigms offers a powerful opportunity: to synthesize realistic humans dressed in controllable garments, embedded within richly described environments, and interacting with objects and other people. This integration could transform applications in e-commerce (immersive virtual try-on experiences), creative industries (fashion films, digital avatars), and simulation (training data for human-AI interaction).

Mission confiée

The successful candidate will play a critical dual role: providing essential engineering support and systems development for the research team while simultaneously contributing original research and development efforts toward the core project goals. This position requires strong software engineering skills combined with deep expertise in computer vision and generative AI.

Engineering Responsibilities (Development & Support)
System Development & Optimization: Design, implement, and maintain the infrastructure and core components for large-scale generative models (e.g., text-to-video, virtual try-on architectures). Focus on optimizing model training and inference pipelines for efficiency and scalability (e.g., using frameworks like PyTorch/TensorFlow, distributed training).

Data & Pipeline Management: Develop robust data ingestion and processing pipelines for multi-modal data (text prompts, garment images, 3D body parameters). Ensure the integrity and accessibility of research datasets.

M.L.Ops and Deployment: Facilitate the transition of research prototypes into stable, reusable codebase modules. Implement version control, automated testing, and documentation to support rapid iteration and collaboration within the research team.

Tooling and Infrastructure: Provide development support for research experiments, including setting up and managing cloud/GPU compute resources, developing visualization tools, and maintaining a high-quality, reproducible code environment.

Evaluation System Implementation: Engineer the software infrastructure for the comprehensive evaluation framework, automating quantitative metrics (e.g., FID, LPIPS, temporal coherence measures) and integrating tools for collecting and analyzing perceptual user study data.

Research & Development Objectives
The candidate will actively contribute to the following research objectives:

Architecture Development: Design and implement novel generative architectures that unify text-based scene control with fine-grained garment preservation. This involves combining the flexibility of text-to-video models with precision garment conditioning.

Fidelity and Coherence: Develop and integrate components (e.g., specialized attention mechanisms, diffusion model conditioning) to ensure that virtual garments retain their texture, shape, and material properties while maintaining coherence in lighting, shading, and motion across the generated scene.

Realistic Interaction Modeling: Research and implement techniques to model dynamic garment response to human-environment interactions (e.g., sitting, object handling, movement). Explore methods for integrating 3D body parameters and physical constraints into the generation process for enhanced realism.

Multi-Modal Conditioning: Lead the implementation of advanced multi-modal conditioning strategies, effectively leveraging text prompts, garment images, and body parameters simultaneously to steer the generation process.

Personal Research: The role includes dedicated time for the candidate to propose and lead self-directed research threads that align with or extend the project's core goals, resulting in publications and/or patentable technology.

Principales activités

The Research Engineer will drive the technical development and implementation of advanced generative models, providing essential engineering support for a leading research project while pursuing independent research contributions.

Technical Scope & Implementation
The role involves the engineering execution of a multi-stage technical roadmap, requiring expertise in deep learning frameworks and system architecture design:

System Architecting: Design and optimize scalable architectures that unify text-to-video (T2V) models with fine-grained visual conditioning (e.g., virtual try-on, object control).

Conditioning Integration: Develop and implement multi-modal conditioning techniques, utilizing text, image, and body parameters simultaneously to control generation fidelity.

Dynamic Modeling: Engineer modules to incorporate physics-aware or learned priors to model realistic garment and object deformation in response to human pose changes and environmental interactions.

Coherence Mechanisms: Implement cross-attention and control networks, alongside advanced loss functions (e.g., temporal regularization), to ensure lighting, texture, and motion consistency across video sequences.

Evaluation Framework Development: Build and maintain an automated evaluation suite to benchmark the proposed framework using both standard quantitative metrics (e.g., FID) and tools for conducting perceptual user studies.

Infrastructure Support: Provide robust M.L.Ops support, including pipeline development, optimization for distributed training, and maintaining a high-quality, reproducible research codebase.

Research Contribution
The candidate is expected to contribute original research ideas and technical solutions to key challenges, leading to publications and technical innovations within the domain of generative human-centric video.

Compétences

Compétences techniques et niveau requis :We are seeking a motivated PhD candidate with a strong background in one or more the following areas :

- speech processing, computer vision, machine learning,
- solid programmming skills
- interest in connecting AI with human cognition Prior experience with LLM, SpeechLMs, RL algorithms, or robotic platforms is a plus, but not mandatory

Langues : Anglais

Avantages

- Restauration subventionnée
- Transports publics remboursés partiellement
- Congés: 7 semaines de congés annuels + 10 jours de RTT (base temps plein) + possibilité d'autorisations d'absence exceptionnelle (ex : enfants malades, déménagement)
- Possibilité de télétravail et aménagement du temps de travail
- Équipements professionnels à disposition (visioconférence, prêts de matériels informatiques, etc.)
- Prestations sociales, culturelles et sportives (Association de gestion des oeuvres sociales d'Inria)
- Accès à la formation professionnelle
- Sécurité sociale

Postuler sur le site du recruteur

Ces offres pourraient aussi vous correspondre.