Recrutement INRIA

Phd Position F - M Geometric-Semantic Adaptation Of Multimodal Llms For High-Level Landmark Detection In Complex Environments H/F - INRIA

  • Villers-lès-Nancy - 54
  • CDD
  • INRIA
Publié le 9 avril 2026
Postuler sur le site du recruteur

Les missions du poste


A propos d'Inria

Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.
PhD Position F/M Geometric-Semantic Adaptation of Multimodal LLMs for High-level Landmark Detection in Complex Environments
Le descriptif de l'offre ci-dessous est en Anglais
Type de contrat : CDD

Niveau de diplôme exigé : Bac +5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

Not applicable.

Mission confiée

For a better knowledge of the proposed research subject :
A state of the art, bibliography and scientific references are available at the following URL:

Principales activités

Context:

Landmark detection, description and matching is the cornerstone of autonomous visual localization systems deployed in unknown environments. While most widely-adopted and accurate solutions exploit low-level landmarks (i.e., points or lines), dealing with large-scale and visually ambiguous environments remains highly challenging due to the inherent multiplicity, ambiguity and sensitivity of local primitives. In the perspective of localization systems with larger scope of application, high-level landmarks such as objects present in the scene have proven to offer key advantages like lower multiplicity, higher detection repeatability across viewpoints, and lower ambiguity compared to their local counterparts. However, existing solutions require the prior intervention of an expert to identify the object landmarks that can be used for localization in a given environment. Moreover, object detectors used in these methods must be finetuned to recognize objects beyond common categories. The recent emergence of Multimodal Large Language Models (MLLMs) represents a promise of lower human intervention and easier deployment, but the consistency of their predictions under camera movements and their geometric accuracy are still to demonstrate. Moreover, challenges posed by complex man-made environments such as museums or factories, often featuring intra-class variations of uncommon objects, are to be addressed.

Objectives:

The research of this PhD will be articulated around the concept of useful landmark for localization in complex environments. Indeed, unlike cases where object detection or segmentation methods are used with no other objective than their own, using objects as landmarks for localization introduces specific constraints in terms of repeatability across different viewpoints, distinctiveness with respect to other landmarks, geometric accuracy and adequate distribution within the environment.

To address these challenges, we propose to exploit the possibilities offered by MLLMs (e.g., BLIP-2, LLaVA, MiniGPT-4), able to follow instructions or answer questions about an image, to extract localization information from images. More precisely, we want to examine how their general-purpose detection and segmentation abilities can be redirected towards automatically identifying high-level localization landmarks in specialized environments. For that, we first propose to assess both geometric and semantic sensitivity of different MLLMs to different combinations of visual and textual prompts, in order to derive automated prompting strategies. In particular, we want to study integration of 3D geometric and fine-grained semantic information within the prompts, and assess geometric accuracy of corresponding models' answers. If necessary, we will then propose dedicated learning strategies for inducing the desired geometric capabilities within the model. In a second phase, we want to examine potential complementarity between MLLMs and scene graphs built from images to combine localization methods with adequate scene modeling.

Discussions with local LLMs experts will be held throughout the project to help the PhD student deal with the specific characteristics of the language modality.

Compétences

Profile:

- The candidate is completing a Master's or engineering's degree in Computer Vision, Electrical Engineering, Computer Science, Applied Mathematics or a related field.
- A strong background in image processing or/and in computer vision is required.
- Strong programming skills in Python.
- Strong mathematical background.
- Familiarity with deep learning frameworks such as PyTorch.
- Commitment, team working and a critical mind.
- Fluent verbal and written communication skills in English.

Avantages

- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage

Rémunération

From €2788 gross/month

Postuler sur le site du recruteur

Ces offres pourraient aussi vous correspondre.

Acheteur H/F

  • Villers-lès-Nancy - 54
  • CDI
  • Manpower
Publié le 27 avril 2026
Je postule