A Neural Corpus Indexer for Document Retrieval

NeurIPS 2022, Outstanding Paper Award, Microsoft

Jan 02, 2023

Motivation

Problem: Current document retrieval methods trade-off semantic and term-based search methods.

Solution: Differentiable document retrieval could leverage deep learning to train neural nets to predict relevant docIds based on the search query.

Insights:

An end-to-end differentiable document retrieval model can significantly outperform both inverted index and dense retrieval solutions.
In the absence of training data, accurate training data can be created using documents and a query generator.
A decoder that is “prefix-aware weight-adaptive” can change the encoding of documents based on the prefix provided — this demonstrates improvement over vanilla decoders.

Development and Techniques

Core Structure

Input: A hierarchical collection of documents, each with specific docids
Query generation: Accept document IDs and document content, and generate ground truth queries as well as question queries.
Training pipeline: Accept query (ground/generated) as input, document ID is expected output, train transformer with cross-entropy loss.

Semantic Identifiers for Documents

Goal: Documents with similar semantics should have close docids to facilitate the learning process.
Solution: Encode document IDs using hierarchical k-means algorithm using BERT.

Prefix-Aware Weight-Adaptive Decoder

Different from standard decoding task, the meanings of the same token appearing at different places of the same identifier are different, as they correspond to different clusters in the tree structure.
Example:
- If identifier = 355, the 5 in the second position and the 5 in the third position have different meanings.
- If identifier1 = 125 and identifier2 = 365, the 5 in the third position in either identifier are encoded differently.

Inference

Beam-search over decoder output for N documents.

Evaluation

The Neural Corpus Indexer (NCI) is evaluated using two benchmarks: NQ320K (natural questions) and TriviaQA. Below is it’s evaluation against other retrieval techniques:

It outperforms the previous state-of-the-art in Recall@1, Recall@10 and Recall@100, even with just the base model. This is promising, as it demonstrates the soundness of the model, not just benefits of scale in terms of parameters or ensembling.

The paper also presents a detailed ablation study that argues soundly how every aspect of their pipeline — query generation, regularization, semantic IDs, decoders etc — is needed to improve accuracy. For those curious about what an ablation study should look like, this is a great example!

Limitations and Future Work

The model is not yet ready for scale.
- The two benchmarks it was tested against are relatively small, and to encode the entire web, the model would need to be much bigger than current retrieval models.
- Beam search inference is too slow to serve the needs of web querying.
- It is difficult to update the model-based index with new documents.
The paper suggests four avenues for future work:
- The architecture of sparsely-gated Mixture of Expert (MoE) [2] to enhance the model capacity.
- Documents can be grouped into semantic clusters, and NCI can be used to retrieve relevant cluster identifiers, to improve efficiency.
- Model compression techniques, like weight quantization [3] and knowledge distillation [4], to speed up inference.
- A hybrid solution by building another index that serves new documents through traditional indexing algorithms.

References

[1] Paper link: https://arxiv.org/pdf/2206.02743.pdf

[2] https://machinelearningmastery.com/mixture-of-experts/

[3] https://medium.com/@joel_34050/quantization-in-deep-learning-478417eab72b

[4] https://neptune.ai/blog/knowledge-distillation

The Daily Ink ✍️

Discussion about this post