Talking to models requires special prompts that help them think sequentially
How and Why Chain-of-Thought works so well
Today, we’re taking a step back from making, fine-tuning and training models, and taking a step forward into prompting models. As these models become ubiquitous, research into how best to use them becomes increasingly essential. You’ve probably heard some noise around prompt engineering. Riley Goodside on Twitter is a great example of the power of prompting.
The most well-known research work in this domain comes to us from Google: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. This is the paper we will be discussing today.
Introduction and Motivation
Insight: A lot of problems, like arithmetic, can benefit from a “chain of thought” — solving intermediate steps in order to get to the final answer. Prompting provides a great avenue for in-context few-shot learning. Thus, prompting is a great way to encourage models to perform chain of thought.
An example below:
Why Chain of Thought
There are many other approaches the authors could have taken. Other papers have integrated the idea of a scratchpad, the idea of offloading to other tools, as well as the idea of a special work token that allows the model to “think”.
The authors claim that chain of thought is effective for the following reasons:
Multi-step problems can be naturally decomposed and additional computation can be allocated to problems with more reasoning steps.
It allows, in principle, the machine to solve any task a human can solve via language.
It’s interpretable.
They also discover that chain-of-thought is a emergent phenomena of large language models — it works better as the model scales up. Given the ever-increasing size of these models, the assumption is that chain-of-thought would perform increasingly better on future models!
Evaluation
Two core observations: 1) chain-of-thought prompting is an emergent ability of model scale, and 2) it has larger performance gains for more complicated problems. It also compares favorably to prior state-of-the-art (SOTA), which were fine-tuned models when the chain-of-thought models are not. With that in mind, these results are quite impressive.
Chain of thought reasoning also improved the “common sense” performance of the models on relevant benchmarks. This supports the idea of chain-of-thought is helping the models delegate computation load to intermediate tasks better.
For many reasoning tasks where the scaling curve looked flat with increasing model size, chain-of-thought prompting lead to dramatically increasing scaling curves. The core idea — standard prompting is just a lower bound of LLM capabilities, which can be greatly improved with better prompts.
Limitations and Future Work
The paper is a pretty convincing argument for the effectiveness of chain-of-thought. Future research can look to quantify this — how much more can be expect reasoning ability to improve with improve in model scale? How do we rate prompts in quality of producing chains of thought? What other prompting methods might improve model performance futher?
Although the cost of manually augmenting examplars with chain of thought is minimal in a few shot setting, such annotation could be prohibitive for fine-tuning a model on chains of thought. Auto-generating annotations is exciting future work.
There is no guarantee of correct reasoning paths — chain of thought can lead to correct and incorrect answers.
If CoT is an emergent phenomena of scale, there is potential to explore how one can induce reasoning in smaller models as well.
In Summary — I really like this paper, and think it is massively underrated. You don’t need a supercomputer or clusters of GPUs to test these results yourself. There is no massive technical challenge in the paper, no elaborate algorithm. Yet, it was one of the most influential papers of 2022. Science is all about hypothesis and experiments, and I think this paper is a great representation of that.
You, whoever you are, could have written this paper — you certainly have all the tools the authors had! I think it’s a great starting point for anyone looking to get into reading research papers as well, and I’d recommend giving it a shot.
Until next time!