SIGIR 2024 Tutorial:
Recent Advances in Generative Information Retrieval


Yubao Tang¹,	Ruqing Zhang¹,	Zhaochun Ren²,	Jiafeng Guo¹,	Maarten de Rijke³

¹CAS Key Lab of Network Data Science and Technology, ICT, CAS, University of Chinese Academy of Sciences, ²Leiden University ³University of Amsterdam

March 28 2024

About this tutorial

Generative retrieval (GR) has become a highly active area of information retrieval (IR) that has witnessed significant growth recently. Compared to the traditional ``index-retrieve-then-rank'' pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications.

We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and the applications of GR. We end by outlining remaining challenges and issuing a call for future GR research. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.

Schedule

Our tutorial is scheduled for July 14-18 2024. The slides is [here] .

Time	Section	Presenter
09:00 - 09:10	Section 1: Introduction	Maarten de Rijke
09:10 - 09:30	Section 2: Definition & Preliminaries	Zhaochun Ren
09:30 - 10:10	Section 3: Docid designs	Yubao Tang
10:10 — 10:25	15min coffee break
10:25 - 11:00	Section 4: Training approaches	Zhaochun Ren
11:00 - 11:20	Section 5: Inference strategies	Yubao Tang
11:20 - 11:30	Section 6: Applications	Yubao Tang
11:30 - 11:50	Section 7: Challenges & Opportunities	Maarten de Rijke
11:50 - 12:00	Q & A	All

Reading List

The tutorial extensively covers papers highlighted in bold.

Section 3: Docid design

3.1 Pre-defined docids

3.1.1 A single docid represents a document

3.1.1.1 Number-based docids

Unstructured atomic integers

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index (Zhou et al. 2023)
Generative Retrieval as Dense Retrieval (Nguyen and Yates et al. 2023c)
Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
CodeDSI: Differentiable Code Search (Nadeem et al. 2022)
DSI++: Updating Transformer Memory with New Documents (Mehta et al. 2022)

Naively structured strings

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
Bridging the Gap between Indexing and Retrieval for Differentiable Search Index with Query Generation (Zhuang et al. 2023)
CodeDSI: Differentiable Code Search (Nadeem et al. 2022)

Semantically structured strings

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
A Neural Corpus Indexer for Document Retrieval (Wang et al. 2022)
Understanding Differential Search Index for Text Retrieval (Chen et al. 2023c)
CodeDSI: Differentiable Code Search (Nadeem et al. 2022)

Product quantization strings

Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
Continual Learning for Generative Retrieval over Dynamic Corpora (Chen et al. 2023a)
Recommender Systems with Generative Retrieval (Rajput et al. 2023)

3.1.1.2 Word-based docids

Titles

Autoregressive Entity Retrieval (De Cao et al. 2021)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)
GERE: Generative evidence retrieval for fact verification (Chen et al. 2022a)
Data-efficient Autoregressive Document Retrieval for Fact Verification (Thorne et al. 2022)
Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
Generative Multi-hop Retrieval (Lee et al. 2022)
Nonparametric Decoding for Generative Retrieval (Lee et al. 2023)
Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)

URLs

Ultron: An ultimate retriever on corpus with a model-based indexer (Zhou et al. 2022)
TOME: A Two-stage Approach for Model-based Retrieval (Ren et al. 2023)
Data-efficient Autoregressive Document Retrieval for Fact Verification (Thorne et al. 2022)

Pseudo queries

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies (Tang et al. 2023a)
Multiview Identifiers Enhanced Generative Retrieval(Li et al. 2023)

Important terms

Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines (Zhang et al. 2023)

3.1.2 Multiple docids represent a document

3.1.2.1 Single type

Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning (Chen et al. 2023b)

3.1.2.2 Diverse types

Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)

3.2 Learnable docids

3.2.1 Repeatable learnable docids

Learning to Tokenize for Generative Retrieval (Sun et al., 2023)
Auto Search Indexer for End-to-End Document Retrieval (Yang et al., 2023)

3.2.2 Unique learnable docids

NOVO: Learnable and Interpretable Document Identifiers for Model-Based IR (Wang et al., 2023)

Section 4: Training approaches

4.1 Stationary scenarios

4.1.1 Supervised learning

Transformer Memory as a Differentiable Search Index (Tay et al. 2022)
Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies (Tang et al. 2023a)
Bridging the Gap between Indexing and Retrieval for Differentiable Search Index with Query Generation (Zhuang et al. 2023)
A Neural Corpus Indexer for Document Retrieval (Wang et al. 2022)
How Does Generative Retrieval Scale to Millions of Passages? (Pradeep et al., 2023)

4.1.2 Pre-training

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)

4.1.3 Pairwise optimization

Learning to Rank in Generative Retrieval (Li et al., 2024)

4.1.4 Listwise optimization

Listwise Generative Retrieval Models via a Sequential Learning Process (Tang et al. 2023b)

4.1.5 Multiple optimization

Enhancing Generative Retrieval with Reinforcement Learning from Relevance Feedback (Zhou et al., 2023)

4.2 Dynamic scenarios

IncDSI: Incrementally Updatable Document Retrieval (Kishore et al., 2023)
DSI++: Updating Transformer Memory with New Documents (Mehta et al. 2022)
Continual Learning for Generative Retrieval over Dynamic Corpora (Chen et al. 2023a)

4.3 GR & QA

Generative retrieval for conversational question answering (Li et al., 2023)
Re3val: Reinforced and Reranked Generative Retrieval (Song et al. 2024)
UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models (Li et al. 2023)

4.4 Large-scale corpora

Scalable and Effective Generative Information Retrieval (Zeng et al., 2023)

Section 5: Inference strategies

5.1 A single docid represents a document

Constrained beam search with prefix tree

Autoregressive Entity Retrieval (De Cao et al. 2021)

Constrained greedy search with inverted index

Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines (Zhang et al. 2023)

5.2 Multiple docids represent a document

Constrained beam search with FM-index

Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)

Aggregation functions

Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)

Section 6: Applications

6.1 Knowledge-intensive language tasks (KILT)

Autoregressive Entity Retrieval (De Cao et al. 2021)
GERE: Generative evidence retrieval for fact verification (Chen et al. 2022a)
Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning (Chen et al. 2023b)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)
Data-efficient Autoregressive Document Retrieval for Fact Verification (Thorne et al. 2022)

6.2 Multi-hop retrieval

Generative Multi-hop Retrieval (Lee et al. 2022)

6.3 Recommendation

Recommender Systems with Generative Retrieval (Rajput et al. 2023)
Generative Retrieval with Semantic Tree-Structured Item Identifiers via Contrastive Learning (Si et al. 2023 )

6.4 Code retrieval

CodeDSI: Differentiable Code Search (Nadeem et al. 2022)

Available code

Autoregressive Entity Retrieval (De Cao et al. 2021)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks (Chen et al. 2022b)
GERE: Generative evidence retrieval for fact verification (Chen et al. 2022a)
Autoregressive Search Engines: Generating Substrings as Document Identifiers (Bevilacqua et al. 2022)
Multiview Identifiers Enhanced Generative Retrieval (Li et al. 2023)
Continual Learning for Generative Retrieval over Dynamic Corpora (Chen et al. 2023a)
Nonparametric Decoding for Generative Retrieval (Lee et al. 2023)
Bridging the Gap between Indexing and Retrieval for Differentiable Search Index with Query Generation (Zhuang et al. 2023)
A Neural Corpus Indexer for Document Retrieval (Wang et al. 2022)
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning (Chen et al. 2023b)
Understanding Differential Search Index for Text Retrieval (Chen et al. 2023c)
Generative Multi-hop Retrieval (Lee et al. 2022)
Learning to Rank in Generative Retrieval (Li et al., 2024)
IncDSI: Incrementally Updatable Document Retrieval (Kishore et al., 2023)
Generative retrieval for conversational question answering (Li et al., 2023)
Scalable and Effective Generative Information Retrieval (Zeng et al., 2023)