Episodic Transformer for Vision-and-Language Navigation

Alexander Pashevich; Cordelia Schmid; Chen Sun

doi:10.1109/ICCV48922.2021.01564

Communication Dans Un Congrès Année : 2021

Episodic Transformer for Vision-and-Language Navigation

(1) , (2, 3) , (2, 4)

1
2
3
4

Alexander Pashevich

Fonction : Auteur
PersonId : 1040983

Apprentissage de modèles à partir de données massives

Cordelia Schmid

Fonction : Auteur

Google LLC

Models of visual object recognition and scene understanding

Chen Sun

Fonction : Auteur

Google LLC

Brown University

Résumé

Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.

Domaines

Intelligence artificielle [cs.AI] Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

2105.06453v2.pdf (9.6 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Virginie RICHARD : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03371803

Soumis le : vendredi 12 juillet 2024-16:26:27

Dernière modification le : mardi 5 novembre 2024-16:00:03

Archivage à long terme le : lundi 14 octobre 2024-09:33:59

Dates et versions

hal-03371803 , version 1 (12-07-2024)

Identifiants

HAL Id : hal-03371803 , version 1
ARXIV : 2105.06453
DOI : 10.1109/ICCV48922.2021.01564

Citer

Alexander Pashevich, Cordelia Schmid, Chen Sun. Episodic Transformer for Vision-and-Language Navigation. ICCV 2021 - International Conference on Computer Vision, Oct 2021, Virtual, United States. pp.1-18, ⟨10.1109/ICCV48922.2021.01564⟩. ⟨hal-03371803⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS UGA CNRS INRIA INSMI LJK LJK_GI INRIA2 LJK-GI-THOTH PSL

91 Consultations

31 Téléchargements

Episodic Transformer for Vision-and-Language Navigation

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager